Boffins at the University of California, Berkeley, have delved into the undisclosed depths of OpenAI’s ChatGPT and the GPT-4 large language model at its heart, and found they’re trained on text from copyrighted books.

Academics Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman describe their work in a paper titled, “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.”

“We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web,” the researchers explain in their paper.

The team published its code and data on GitHub as well as the list of books identified can be found in this Google Docs file.

GPT-4 was found to have memorized titles such as the Harry Potter children’s books, Orwell’s Nineteen Eighty-Four, The Lord of the Rings trilogy, the Hunger Games books, Hitchhiker’s Guide to the Galaxy, Fahrenheit 451, A Game of Thrones, and Dune, among others.

The authors note that science fiction and fantasy books dominate the list, which they attribute to the popularity of those titles on the web. And they point out that memorizing specific titles has downstream effects. For example, these models make more accurate predictions in answer to prompts such as, “What year was this passage published?” when they’ve memorized the book.

Another consequence of the model’s familiarity with science fiction and fantasy is that ChatGPT exhibits less knowledge of works in other genres. As the paper observes, it knows “little about works of Global Anglophone texts, works in the Black Book Interactive Project and Black Caucus American Library Association award winners.”

Via Twitter, David Bamman, one of the co-authors and an associate professor in the School of Information at UC Berkeley, summarized the paper thus: “Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors.”

The researchers are not claiming that ChatGPT or the models upon which it is built contain the full text of the cited books – LLMs don’t store text verbatim. Rather, they conducted a test called a “name cloze” designed to predict a single name in a passage of 40–60 tokens (one token is equivalent to about four text characters) that has no other named entities. The idea is that passing the test indicates that the model has memorized the associated text.

“The data behind ChatGPT and GPT-4 is fundamentally unknowable outside of OpenAI,” the authors explain in their paper. “At no point do we access, or attempt to access, the true training data behind these models, or any underlying components of the systems. Our work carries out probabilistic inference to measure the familiarity of these models with a set of books, but the question of whether they truly exist within the training data of these models is not answerable.”

To make such questions answerable, the authors advocate the use of public training data – so model behavior is more transparent. They undertook the project to understand what these models have memorized, as the models behave differently when analyzing literary texts that they’ve used for training.

I hope this work will help further advance the state of the art in responsible data curation

“Data curation is still very immature in machine learning,” Margaret Mitchell, an AI researcher and chief ethics scientist for Hugging Face, told The Register.

“‘Don’t test on your training data’ is a common adage in machine learning, but requires careful documentation of the data; yet robust documentation of data is not part of machine learning culture. I hope this work will help further advance the state of the art in responsible data curation.”

The Berkeley computer scientists focused less on the copyright implications of memorizing texts, and more on the black box nature of these models – OpenAI does not disclose the data used to train them – and how that affects the validity of text analysis.

But the copyright implications may not be avoidable – particularly if text-generating applications built on these models produce passages that are substantially similar or identical to copyrighted texts they’ve ingested.

Land of the free, home of the lawsuit

Tyler Ochoa, a professor in the Law department at Santa Clara University in California, told The Register he fully expects to see lawsuits against the makers of large language models that generate text, including OpenAI, Google, and others.

Ochoa said the copyright issues with AI text generation are exactly the same as the issues with AI image generation. First: is copying large amounts of text or images for training the model fair use? The answer to that, he said, is probably yes.

Second: if the model generates output that’s too similar to the input – what the paper refers to as “memorization” – is that copyright infringement? The answer to that, he said, is almost certainly yes.

And third: if the output of an AI text generator is not a copy of an existing text, is it protected by copyright?

Lawsuits against AI text-generating models are inevitable

Under current law, said Ochoa, the answer is no – because US copyright law requires human creativity, though some countries will disagree and will protect AI-generated works. However, he added, activities like selecting, arranging, and modifying AI model output makes copyright protection more plausible.

“So far we’ve seen lawsuits over issues one and three,” said Ochoa. “Issue one lawsuits so far have involved AI image-generating models, but lawsuits against AI text-generating models are inevitable.

“We have not yet seen any lawsuits involving issue two. The paper [from the UC Berkeley researchers] demonstrates that such similarity is possible; and in my opinion, when that occurs, there will be lawsuits, and it will almost certainly constitute copyright infringement.”

Ochoa added, “Whether the owner of the model is liable, or the person using the model is liable, or both, depends on the extent to which the user has to prompt or encourage the model to accomplish the result.”

OpenAI did not respond to a request for comment. It doesn’t even have a chat bot for that? ®