The past year has turned “fair use” from a sleepy copyright doctrine into a front-page legal debate. As artificial intelligence models learn to write, paint, compose, and converse, they also raise an uncomfortable question: what does it mean to “use” something creatively when that use happens inside a neural network? The law of fair use was built for humans quoting paragraphs, not machines parsing petabytes. But the same statute that once guided artists and publishers now sits at the center of every major lawsuit against large language model developers.
Training an AI system requires exposure to enormous amounts of text, code, images, and data. They demand a diet so vast that it inevitably includes copyrighted material, intentionally or otherwise. In principle, this process is transformative. The model doesn’t simply store or reproduce what it reads; it digests patterns, abstracts syntax, and synthesizes a kind of statistical understanding of language. But in practice, that transformation has limits. Some models can be coaxed into regurgitating long passages of copyrighted books or journalistic works. The line between “learning” and “copying,” once obvious to our high school teachers, becomes blurry when the student is a machine and the reproduction happens by accident (allegedly, of course).
The Fair Use Doctrine was designed to accommodate this kind of gray zone, though it never imagined it at this scale. It asks whether a new use adds something genuinely different or transformative, whether it draws from creative or factual works, whether it takes more than is necessary, and whether it harms the market for the original. When it comes to training AI models, all four factors are up for debate. Developers argue that training is fundamentally transformative: the model’s purpose is not to substitute for any single book or dataset, but to learn from the collective shape of language itself. Rights holders counter that copying millions of works to teach a commercial product is not “transformative” in any intuitive sense, especially when that product can spit out convincing imitations of the original.
Courts are only beginning to test these theories. Early decisions in other domains (scanning books for search indexing, or creating thumbnail images for the web) suggest that fair use may tolerate large-scale copying if the purpose is analytical rather than expressive. But large language models complicate that precedent because their outputs are expressive. Their outputs are not strictly analytical with regard to a single, given work, but expressively represent the larger context of everything the model has ever been trained on. A model that can generate new text “in the style of” an author straddles both sides of the legal line: analysis on one side, imitation on the other.
This uncertainty leaves companies and creators alike navigating without clear markers for what is in and out of bounds. For developers, the practical question is not just whether fair use technically applies, but whether it is a hill worth dying on. Some firms are already moving toward hybrid approaches that combine licensed data, public-domain material, and opt-in contributor programs. Others are experimenting with dataset disclosure tools that allow rightsholders to see whether their works were included in training. Still others are exploring technical measures to reduce the risk of verbatim reproduction, turning “fair use” from a purely legal defense into a design principle.
For authors, musicians, and publishers, the challenge is different but equally complex. The prospect of licensing data to AI systems holds promise, but it also threatens to redefine creative labor. If every word and brushstroke can be reduced to a vector in a training set, then every creator becomes a silent collaborator in a machine’s imagination who are compensated, if at all, only through collective bargaining or litigation. Some see that as exploitation; others as artistic evolution.
The eventual resolution will almost certainly be some hybrid of the two. Courts and regulators will not outlaw training wholesale, but they will likely require clearer provenance, transparent data governance, or even micro-licensing mechanisms to distribute value back to creators. Companies that anticipate those shifts and treat dataset design as a matter of compliance, ethics, and brand trust will be better positioned than those who rely solely on legal optimism.
Fair use has always been a balancing act between innovation and protection, between borrowing and theft. What’s new is the scale of the borrowing and the automation of the use. As artificial intelligence continues to learn from the human record, the law must decide whether that record remains property or becomes raw material for the next era of creativity. Either way, the question of “fairness” in fair use has never mattered more.