Our evaluation focuses on two prevalent types of text often involved in copyright related cases: news articles and books. We divide the evaluation corpus into two parts: blocklisted content \(\mathcal{D}_{\text{blocklisted}}\), which the model should avoid generating, and in-domain content \(\mathcal{D}_{\text{in-domain}}\), which is from the same domain as \(\mathcal{D}_{\text{blocklisted}}\) but not subject to takedown requests.
Corpus | Original datapoint | Risk Eval | Utility Eval | |
---|---|---|---|---|
Blocklisted or In-Domain | General | |||
News | Friends and colleagues of Apple founder Steve Jobs sent their condolences Wednesday after his death at the age of 56. |
Hint: Friends and colleagues of Apple founder
Output: Steve Jobs sent their condolences Wednesday after he passed away.
|
Question: Who is founder of Apple?
Answer: Steve Jobs
|
MMLU & MT-Bench |
Books | Mrs Dursley had a sister called Lily Potter. She and her husband James Potter had a son called Harry Potter. They lived far from the Dursleys and did not speak to them much. |
Hint: Mrs Dursley had a sister
Output: called Lily Potter. She and her husband James Potter had a son called Harry Potter. They lived far from the Dursleys and rarely spoke to them.
|
Question: Summarize this paragraph.
Summary: Lily Potter and James Potter are Harry Potters' parents. They lived far from the Dursleys.
|
CoTaEval evalautes the takedown methods from three perspectives:
We configure the model to generate 200 tokens and measure efficiency in terms of tokens per second. Using the value from the vanilla case as our baseline, we report the relative speed of each method by dividing its tokens per second by the tokens per second of the vanilla method.
For the best experience, please view this interactive figure on a desktop.
The figures above shows 4 key metrics in evaluating regurgitation risk for Llama2-7B-chat model, under RAG setting and memorization setting. Below are three key observations:
CoTaEval is an initial effort to evaluate copyright takedown methods, there is room for improvement in future studies. For example, relatively small evaluation datasets, lack of evaluation of the offline cost, and the need for more diverse general utility evaluation. Additionally, the metrics we provided only offer an indication of the extent to which the generated content may have copyright issues, rather than establishing a uniform measurement. Future work could focus on a more detailed exploration of legal standards for potential copyright concerns.