A new lawsuit filed in California accuses several leading artificial intelligence companies – including Anthropic, Google, OpenAI, Meta, xAI, and Perplexity – of copyright infringement by utilizing illegally obtained books to train their large language models (LLMs). The suit, brought by a group of authors including journalist John Carreyrou, known for his reporting on Theranos, argues that the AI firms profited from the unauthorized use of copyrighted material. This action follows a previous, similar case against Anthropic, and highlights ongoing concerns about AI training data and intellectual property rights.
The complaint, filed in the U.S. District Court for the Northern District of California on November 20, 2023, alleges that the defendants systematically copied books from various sources, including “shadow libraries” known for hosting pirated content, to build their AI models. These models, capable of generating human-like text, are at the core of popular products like ChatGPT, Bard, and Claude. The authors seek financial compensation for the alleged infringement.
The Core of the Dispute: AI Training Data and Copyright
The central issue revolves around the legality of using copyrighted works for AI training, even if those works were obtained illegally. AI models learn by analyzing massive datasets of text and code, identifying patterns and relationships that allow them to generate new content. The quality and scope of this data are crucial to the model’s performance.
This isn’t the first legal challenge of this kind. A class-action lawsuit against Anthropic, settled in September 2023, resulted in a $1.5 billion fund to compensate authors whose works were used in training. However, the plaintiffs in the new lawsuit were dissatisfied with the terms of that settlement.
Previous Ruling and Its Limitations
The judge overseeing the initial Anthropic case ruled that the use of copyrighted material for training purposes constituted “fair use” under copyright law. This decision hinged on the transformative nature of the AI training process – the models don’t simply reproduce the original works, but rather learn from them to create something new. However, the judge also acknowledged that the acquisition of the copyrighted material itself was illegal.
The current plaintiffs argue that the settlement doesn’t adequately address the fundamental problem of AI companies benefiting from stolen intellectual property. They contend that the $3,000 payout offered to eligible authors is a paltry sum compared to the billions of dollars generated by the AI services built on their work.
According to the new lawsuit, the proposed settlement “seems to serve [the AI companies], not creators.” The authors believe the settlement allows AI companies to avoid full accountability for their actions and sets a dangerous precedent for the industry.
Why This Lawsuit Matters: Implications for Authors and AI Development
The outcome of this case could have significant ramifications for both authors and the rapidly evolving field of artificial intelligence. A ruling in favor of the plaintiffs could force AI companies to drastically alter their data acquisition practices and potentially pay substantial royalties to copyright holders. This could increase the cost of developing and deploying AI models.
Conversely, a ruling upholding the “fair use” doctrine, similar to the previous case, would likely embolden AI companies to continue using large datasets of copyrighted material for training, potentially without seeking permission or compensation. This could further exacerbate concerns about the erosion of copyright protections in the digital age.
The lawsuit also raises questions about the responsibility of AI companies to ensure the legality of their data sources. While scraping publicly available data is common practice, the use of known piracy sites introduces a clear legal risk. The plaintiffs argue that the defendants were aware of the illicit origins of much of the training data.
Furthermore, the case touches upon the broader debate surrounding the ethical implications of AI. Many creators feel that their work is being exploited without proper recognition or reward, leading to a sense of unfairness and potentially stifling creativity. The issue of generative AI and its reliance on existing content is a key component of this debate.
The plaintiffs are seeking a jury trial and damages to be determined at trial. They are also requesting an injunction to prevent the defendants from continuing to use illegally obtained books in their AI training processes. The lawsuit specifically targets the use of copyrighted material in the development of models like GPT-4, Gemini, and Llama 2.
Meanwhile, the AI companies named in the suit have not yet issued formal responses. Industry analysts anticipate they will argue that their use of copyrighted material falls under fair use and that requiring them to obtain licenses for all training data would be impractical and stifle innovation. They may also point to the transformative nature of their technology, arguing that it creates new value from existing works.
The legal landscape surrounding AI and copyright is still largely uncharted territory. This case, along with others like it, will help to define the boundaries of fair use and establish the rights and responsibilities of both AI developers and content creators. The courts will need to balance the interests of fostering innovation with the need to protect intellectual property rights.
The next step in the legal process will likely involve the defendants filing motions to dismiss the lawsuit or seeking summary judgment. A hearing date has not yet been set. Observers will be closely watching how the court addresses the complex legal and ethical issues raised by this case, as its decision could shape the future of AI development and copyright law for years to come.

