District court holds that Meta’s downloading of books from online “shadow libraries” and use of such books to train its Llama large language models constitutes fair use, but endorses “market dilution” theory of harm as potential path to undercut fair use defense in future cases involving generative AI.
Plaintiffs, a group of 13 published authors, including Sarah Silverman, Rachel Louise Snyder, Junot Diaz and Ta-Nehisi Coates, sued Meta, alleging, among other claims, that Meta violated the Digital Millennium Copyright Act and committed copyright infringement by training its Llama large language models (LLMs) on their works without permission.
Plaintiffs moved for partial summary judgment, arguing that they had made out a prima facie case for copyright infringement and that Meta’s conduct did not constitute fair use. Meta filed a cross-motion for summary judgment on the ground that its reproduction was fair use as a matter of law. The district court agreed with Meta.
The record on summary judgment showed that Meta had downloaded the books it used for training purposes from “shadow libraries,” or online repositories that offer books and other media for free download, regardless of whether they are copyrighted. Among other things, Meta had downloaded the Library Genesis (LibGen) database as well as Anna’s Archive, a compilation of shadow libraries including LibGen, Z-Library and others.
To download these libraries (which contained millions of works) more quickly and without slowing down its networks, Meta “torrented” them—a file-sharing technique that entails the simultaneous distribution of small portions of a larger file from many different sources. The parties disputed, however, whether Meta had also uploaded the files that it downloaded to other users’ computer systems during or after the torrenting process.
Meta then added the books it had downloaded to the datasets it used to train the Llama LLMs, but it post-trained the LLMs to prevent them from “memorizing” and outputting certain text from their training data. As a result, neither Meta’s expert witness nor the authors’ expert witness could get the Llama LLMs to generate more than 50 words and punctuation marks from plaintiffs’ books.
In its fair use analysis, the court first addressed the purpose and character of the use and the extent to which the secondary use was transformative. The court contrasted the purpose of Meta’s copying in order to train its LLMs, which are innovative tools that can be used to generate diverse text and perform a wide range of functions, with the purpose of plaintiffs’ books, which were written to be read for entertainment or education, and held that Meta’s use was transformative. In so holding, the court contrasted the way a human reads a book with an LLM’s consumption of a book by ingesting text to learn “statistical patterns” of how words are used together in different contexts and updating its general understanding of language. The court similarly rejected plaintiffs’ argument that Meta’s use merely amounts to a “repackaging” of their books, holding that there was no evidence the LLMs could reproduce substantial portions of the works or that Meta developed Llama with the purpose of enabling it to create books that compete with those of plaintiffs. As the court stated, the evidence showed that Meta, at most, “wanted Llama to be able to generate text in certain styles,” but “style is not copyrightable—only expression is.”
As to the commercial nature of the use—a matter also relevant to the first fair use factor—the court held that while the profit Meta stands to gain from its development of a product trained on plaintiffs’ works is relevant to the fair use analysis overall, it does not tilt the first factor in plaintiffs’ favor. “Commercialism,” the court noted, “tends to be less important when the secondary use is highly transformative,” as it was here.
The fact that Meta downloaded the books from unauthorized “shadow libraries” also did not tilt the first factor in plaintiffs’ favor, in the court’s view. The use of such libraries could be relevant to the “character” of Meta’s use and weigh against fair use, the court observed, “if it benefitted those who created the libraries and thus supported and perpetuated their unauthorized copying and distribution of copyrighted works” through the torrenting process. But, as the court noted, plaintiffs did not submit sufficient evidence of this.
The court also refused to consider Meta’s downloading of plaintiffs’ works as a distinct use wholly separate from the training of its LLMs on those works, stating that “[b]ecause Meta’s ultimate use of the plaintiffs’ books was transformative, so too was Meta’s downloading of those books.” Although plaintiffs argued that Meta had downloaded pirated datasets that were never used for training, the court held that plaintiffs provided no evidence to substantiate this claim. “In any event,” the court stated, “even if Meta did download some copies that weren’t ultimately used for training, fair use doesn’t require that the secondary user make the lowest number of copies possible.”
The court next analyzed the second factor—the nature of the copyrighted works—and held that it weighed in favor of plaintiffs because their novels, memoirs and plays are highly expressive works. Meta argued that it used plaintiffs’ books only to gain access to their “functional elements.” But the court rejected Meta’s attempt to analogize to “intermediate copying” cases in the Ninth Circuit, which involved video game companies that copied video game console manufacturers’ copyrighted code and reverse-engineered it to understand its functional elements.
The court similarly rejected Meta’s attempt to analogize to the Second Circuit’s holding in Authors Guild v. Google, Inc., involving the Google Books database, which allowed users to search and see what books in the database contained the search terms and did not depend on the type of content ingested for its functionality. Unlike in those cases, here the LLMs’ training depended on the specific word order, word choice, grammar and syntax, which are products of plaintiffs’ creative expression.
As to the third factor—the amount and substantiality of the copyrighted works used—the court held that it favored Meta. Even though Meta had copied plaintiffs’ books in their entirety, the amount that it copied was reasonable given its relationship to the transformative purpose, the court held. Because the quality of the LLMs depended on the quality of the material used to train them, it was reasonably necessary for Meta to make use of the entirety of plaintiffs’ works.
Moving to the fourth fair use factor—the effect of the use on the potential market for or value of the copyrighted works—which the court deemed “the single most important element of fair use,” the court considered three potential theories of market harm: (1) that Llama could reproduce snippets of plaintiffs’ books, thereby allowing users to access those works or substitute for them for free via the model; (2) that Meta’s use harmed the market for licensing books for AI training; and (3) that Llama can generate works that are similar enough (in subject matter or genre) that they will compete with the originals and thereby indirectly substitute for them.
The court rejected the first theory as contrary to the evidence because Llama does not allow users to generate any meaningful portion of plaintiffs’ books. The court reasoned that Llama’s ability to regurgitate miniscule portions of plaintiffs’ books if manipulated into doing so does not threaten to have a meaningful or significant effect on the potential market for or value of plaintiffs’ books.
The court also rejected plaintiffs’ second (and primary) theory of market harm—that Meta’s unauthorized use of their books for LLM training harms the market for licensing their books for that purpose—because any harm that plaintiffs may suffer from the loss of fees paid to license a work for a transformative purpose is not cognizable under the fourth fair use factor.
Finally, the court held that plaintiffs had failed to adduce any evidence demonstrating that using copyrighted books to train an LLM might harm the market for those works by enabling the rapid generation of competing works (i.e., by diluting the market for those works). In response to Meta’s motion for summary judgment, plaintiffs failed to proffer evidence that Meta’s use of their books to train Llama had harmed book sales or that Llama was even capable of generating such competing books to begin with.
Nonetheless, the court endorsed this as a potentially legitimate theory of harm on the fourth fair use factor in cases where plaintiffs develop a sufficient record of such harm. The court expressed concern that even if AI systems did not output material that was substantially similar to any of the works on which they were trained, their ability to “generate countless competing works with a miniscule fraction of the time and creativity it would otherwise take” would “severely undermine the incentive for human beings to create,” which, according to the court, is a “harm that copyright aims to prevent.” It noted that particular types of work—such as news articles—might be more vulnerable to this kind of market dilution.
The court cautioned that a market dilution theory of copyright harm had “never made a difference in a case before,” but nevertheless expressed the view that it “will often cause plaintiffs to decisively win the fourth [fair use] factor—and thus win the fair use question overall[.]” But because the authors suing Meta did not substantiate this theory with an evidentiary record, the court expressed that it had “no choice” but to grant summary judgment for Meta.
The court concluded by clarifying that the scope and consequences of its ruling are limited, noting that “this ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”
Summary prepared by Frank D’Angelo and David Forrest