Databricks LLM Copyright Suit Survives as Authors Allege Pirated Database Training
A federal judge has declined to dismiss a class action lawsuit against Databricks, allowing claims to proceed that the company's large language model was trained on a database containing pirated versions of approximately 196,000 book titles, including copyrighted works from the authors bringing the case. The ruling marks a significant setback for Databricks' efforts to escape the litigation, with the court demanding additional evidence from both sides to clarify the scope of the alleged infringement.
The plaintiffs allege that Databricks acquired or deployed an LLM built using the "Books3" dataset, a compilation that reportedly included unauthorized copies of literary works. The authors contend this training data violated their copyrights, a claim that has drawn increasing judicial scrutiny as the case moves forward. The presiding judge has repeatedly pressed the parties for more detailed information, suggesting the court is not yet satisfied with the record as it stands. Databricks had sought dismissal, arguing the case lacked sufficient factual basis, but that motion has now been denied.
The litigation underscores mounting legal exposure for AI companies over how training data is sourced. If the case proceeds to discovery or trial, it could force greater transparency around the datasets used to develop commercial language models. For the book authors, the stakes include not only individual compensation but a broader precedent that could reshape how AI firms approach content licensing. Courts have yet to definitively rule on whether training an AI model on copyrighted material constitutes infringement, leaving the outcome of this and similar cases with potentially sweeping implications for the sector.