CotaEval: Evaluating Copyright Takedown Methods for Language Models

Copyright and Language Models

Causes

Recent litigation (Tremblay v. OpenAI, Inc, Kadrey v. Meta Platforms, Inc., Chabon v. OpenAI, Inc., DOE 1 v. GitHub, Inc.) has pointed to two scenarios where a language model deployment might lead to copyright concerns:

Copyrighted content is memorized within the model's parameters during training (Memorization),
Copyrighted content is incorporated as additional context during retrieval-augmented generation (RAG).

Takedown methods

Our evaluation considers three types of takedown methods that intervene at different stages of the language model:

Strategies that generally try to prevent the regurgitation of training data without specifying a blocklist, including System Prompt and Top-\(k\) Perturbation (Adding Gaussian noise to the logits in top-\(k\) sampling);
Methods that prevent the generation of blocklisted content during decoding, including MemFree (Ippolito et al., 2023), Reversed Context Aware Decoding (R-CAD; Shi et al., 2023);
Training-based interventions like unlearning, including Gradient Ascent (Thudi et al., 2022), Gradient Difference(Liu et al., 2022), KL minimization(Golatkar et al., 2020), and Preference Optimization (Rafailov et al., 2024).

Evaluation Pipeline

Our evaluation focuses on two prevalent types of text often involved in copyright related cases: news articles and books. We divide the evaluation corpus into two parts: blocklisted content \(\mathcal{D}_{\text{blocklisted}}\), which the model should avoid generating, and in-domain content \(\mathcal{D}_{\text{in-domain}}\), which is from the same domain as \(\mathcal{D}_{\text{blocklisted}}\) but not subject to takedown requests.

Corpus	Original datapoint	Risk Eval	Utility Eval
Corpus	Original datapoint	Risk Eval	Blocklisted or In-Domain	General
News	Friends and colleagues of Apple founder Steve Jobs sent their condolences Wednesday after his death at the age of 56.	Hint: Friends and colleagues of Apple founder Output: Steve Jobs sent their condolences Wednesday after he passed away.	Question: Who is founder of Apple? Answer: Steve Jobs	MMLU & MT-Bench
Books	Mrs Dursley had a sister called Lily Potter. She and her husband James Potter had a son called Harry Potter. They lived far from the Dursleys and did not speak to them much.	Hint: Mrs Dursley had a sister Output: called Lily Potter. She and her husband James Potter had a son called Harry Potter. They lived far from the Dursleys and rarely spoke to them.	Question: Summarize this paragraph. Summary: Lily Potter and James Potter are Harry Potters' parents. They lived far from the Dursleys.	MMLU & MT-Bench

CoTaEval evalautes the takedown methods from three perspectives:

Risk Evaluation

For exact match, we evaluate: the length of character-level Longest Common Subsequence (LCS) \(\ell_{\mathsf{LCS}}^c\) between the generated text and the original text, and the length of word-level LCS \(\ell_{\mathsf{LCS}}^w\).
For near duplicate, we evaluate: ROUGE-1, ROUGE-L (Lin, 2004), the length of word level Accumulated Common Subsequences (ACS) \(\ell_{\mathsf{ACS}}^w\), Levenshtein distance \(\ell_{\mathsf{Lev}}\) (Levenshtein, 1966), and MinHash Similarity \(\xi_{\mathsf{MH}}\) (Broder, 1997).
For semantic similarity, we evaluate the cosine similarity between the generated content and the blocklisted content using an off-the-shelf embetting model (\(\xi_{\mathsf{Sem}}\)).

Utility Evaluation

For blocklisted utility and in-domain utility, we compute word level F1 score between the generated content and answer for QA tasks in news articles, and compute ROUGE score between the generated content and ground truth for summarization taks in books.
For general utility, we use MT-Bench (Zheng et al., 2024) and MMLU (Hendrycks et al., 2021), two widely adopted benchmarks that evaluate the model's knowledge and reasoning abilities across a diverse range of subjects and tasks.

Efficiency Evaluation

We configure the model to generate 200 tokens and measure efficiency in terms of tokens per second. Using the value from the vanilla case as our baseline, we report the relative speed of each method by dividing its tokens per second by the tokens per second of the vanilla method.

Experiments

(a) RAG setting

(b) Memorization Setting

For the best experience, please view this interactive figure on a desktop.

The figures above shows 4 key metrics in evaluating regurgitation risk for Llama2-7B-chat model, under RAG setting and memorization setting. Below are three key observations:

System Prompt and MemFree offer some mitigation but cannot completely prevent undesirable regurgitation.
Unlearning and Top-\(k\) perturbation reduce similarity but significantly compromises factual knowledge from the blocklisted content.
R-CAD is effective for takedown but comes at the cost of efficiency and risk of utility drop.

Limitations

CoTaEval is an initial effort to evaluate copyright takedown methods, there is room for improvement in future studies. For example, relatively small evaluation datasets, lack of evaluation of the offline cost, and the need for more diverse general utility evaluation. Additionally, the metrics we provided only offer an indication of the extent to which the generated content may have copyright issues, rather than establishing a uniform measurement. Future work could focus on a more detailed exploration of legal standards for potential copyright concerns.

Acknowledgement

We express our gratitude to Tianle Cai, Andrew Sheinberg, and Mengzhou Xia for providing helpful feedback. Boyi Wei is supported by the Francis Robbins Upton Fellowship, and Yangsibo Huang is supported by the Wallace Memorial Fellowship.