Two recent district court opinions from the Northern District of California, filed within days of each other address the use of copyrighted material in training data in two separate market dominating Large Language Models (LLM). These decisions continue to shape the guardrails for incorporating copyrighted content into AI training, offering valuable insights for both AI developers and deployers.
Both decisions reach the conclusion that using copyrighted material for training large language models can be a fair use of the copyrighted material that does not result in copyright infringement via slightly different paths. Together they offer insights into when reproducing entire works for training purposes might constitute fair use, and when it likely does not. Below is an overview of the key takeaways from these decisions, along with guidance on how industry players might respond.
1. Use of Copyrighted Materials for Training is Transformative
Although the two courts differ slightly in their analysis, on one thing they agree: using copyrighted content to train Large Language Models is “highly” or “spectacularly” transformative.
In both cases the relevant facts showed that both AI developers copied entire books, not to distribute or display them, but to extract statistical patterns and relationships between words, enabling the AI to generate new, original text in response to user prompts. Given that this purpose was fundamentally different from the original purpose of the books (i.e. education or entertainment), and the purpose and character of the copying was in service of that distinct purpose, the courts separately concluded that such use could be justified as transformative and weighed in favor of the defendants and their fair use arguments.
2. Can Market Harm Considerations Outweigh that Transformative Use?
Where the courts diverge is in their application of the fourth fair use factor under copyright law and the impact that the above transformative use has on the market for the original copyrighted works.
In the Bartz v. Anthropic case, the court found no actionable market harm for the use of the copyrighted materials for training the AI. The court explains that “the copies used to train specific LLMs did not and will not displace demand for copies of Authors' works, or not in the way that counts under the Copyright Act.” The court stated that the authors’ complaint was akin to arguing that using books for the purpose of “training schoolchildren to write well would result in an explosion of competing works,” and concluded, “[t]his is not the kind of competitive or creative displacement that concerns the Copyright Act.” The court also rejected the argument that the loss of a potential licensing market for AI training was cognizable market harm, stating that “such a market for that use is not one the Copyright Act entitles Authors to exploit.”
In the second case, the plaintiff Kadrey made the same argument as the Bartz plaintiff as to the potential licensing market and the Kadrey court agreed with the Bartz court on the point of loss of the potential licensing market. But where the Bartz court dismisses the potential for competition in the marketplace, the Kadrey court takes a step back from the individual copyright fair use factors and looks at the question of fair use and use for training AI in a more holistic sense. The Kadrey court clearly believes that AI and LLMs in particular are a real and viable threat to the market for original creative works, which potentially disincentivize authors to create new works. This is particularly true for non-fiction genre. However, practically the Kadrey court was challenged by the record of the case in front of it, which simply was not well-developed enough or well-argued enough to support that conclusion. As a result, the Kadrey court reached the same conclusion as the Bartz court, but did so unwillingly, and lamenting the lack of evidence before it.
With respect to the analysis of this fair use factor both courts addressed the fact that each of the LLMs at issue were specifically trained and fine-tuned to reduce any risk that the output from the LLMs would itself be infringing. Neither LLM at issue could regurgitate direct copies of any of the copyrighted work used to train it and both courts specifically reference that fact in their review of the potential market harm analysis. The result is an additional takeaway for AI developers to build in mitigation efforts on this front to minimize the copyright infringement risk and maximize the potential fair use arguments.
3. Pay Attention to the Data Sources and the Specific Use
Both orders also addressed whether the manner in which AI developers acquire copyrighted works for training (including for example, whether the data is obtained lawfully or through unauthorized means) affects the application of the fair use doctrine. The answer is “yes, it does”, but we don’t clearly know “how much”.
The LLM developer in both the Bartz case and the Kadrey case sourced some if not all of the copyrighted material it used to train its LLM from unlawful sources. The developer in the Bartz case for its part downloaded millions of pirated books from sources like Books3, LibGen, and PiLiMi. The developer there used some of the pirated books for training and did not use others, but it retained all unlawful copies in its central library indefinitely. The court in this instance determined that the use of such pirated materials was not fair use under the Copyright Act given the unlawful nature of the materials and the broad reasons that the developer kept the materials without using them for training or for any defined lawful purpose. Indeed, the unlawful source of the copyrighted materials “plainly displaced demand for Authors' books—copy for copy,” and the court warned that condoning this would “destroy the [entire] publishing market if that were the case.”
Interestingly, the developer in the Kadrey case downloaded many of its own materials from some of the same “shadow libraries”, but the court in this instance did not give as much weight to this factor in its analysis. Factually, there was a relevant distinction as the developer in the Kadrey case specifically did not download the materials simply to maintain a central library of materials as did the developer in the Bartz case, but for the purpose of training the AI. Perhaps it was because of this narrower specific defined purpose and its highly transformative nature discussed above that the court did not give as much weight to the unlawful nature of the copies of the works at issue, noting that the legality of the source data was “relevant” but not “determinative”.
Of particular note to those interested in training AI, the developer in the Bartz case had a second data set that it acquired lawfully through purchasing used copies of books on the open market. Again, some of this lawfully purchased material was used for training and some was not, but all copies were ultimately stored and retained in the developer’s central library indefinitely. More importantly, because these copyrighted materials were acquired lawfully, the court determined that use of these materials for training the AI was fair use.
What we can determine from these two decisions is that the circumstances under which copyrighted materials were acquired is relevant in the fair use analysis. But what remains to be seen is the weight courts will give those facts.
4. Moving Forward
Both cases are continuing to be litigated as these respective orders did not resolve all issues before the courts. As a result, more remains to be seen as these issues continue to develop. For the time being, these recent rulings clarify that there are potentially lawful avenues for training LLMs using copyrighted materials. The legitimacy of the source of those copyrighted materials appears to be a factor upon which this analysis may turn, but it is unclear as yet what significance that factor may hold in the analysis. And continuing efforts to ensure the AI output is not infringing appear to be relevant. Businesses seeking to utilize large datasets for LLM training should prioritize comprehensive compliance strategies, including thorough vetting of all data sources to help minimize risk, and express policies and practices to mitigate the possibility of infringing output.
For companies that are AI deployers, without line of sight to the data sources, there could be downstream liability if the developer lacked the necessary rights to the training data. To mitigate this risk, it is important for licensees to address these concerns through robust contract provisions in their licensing agreements.