FTA:
Anthropic warned against “[t]he prospect of ruinous statutory damages—$150,000 times 5 million books”: that would mean $750 billion.
So part of their argument is actually that they stole so much that it would be impossible for them/anyone to pay restitution, therefore we should just let them off the hook.
Funny how that kind of thing only works for rich people
This version of too big to fail is too big a criminal to pay the fines.
How about we lock them up instead? All of em.
Ahh cant wait for hedgefunds and the such to use this defense next.
Hold my beer.
Lawsuits are multifaceted. This statement isn’t a a defense or an argument for innocence, it’s just what it says - an assertion that the proposed damages are unreasonably high. If the court agrees, the plaintiff can always propose a lower damage claim that the court thinks is reasonable.
The problem isnt anthropic get to use that defense, its that others dont. The fact the the world is in a place where people can be fined 5+ years of a western European average salary for making a copy of one (1) book that does not materially effect the copyright holder in any way is insane and it is good to point that out no matter who does it.
And this is how you know that the American legal system should not be trusted.
Mind you I am not saying this an easy case, it’s not. But the framing that piracy is wrong but ML training for profit is not wrong is clearly based on oligarch interests and demands.
This is an easy case. Using published works to train AI without paying for the right to do so is piracy. The judge making this determination is an idiot.
You’re right. When you’re doing it for commercial gain, it’s not fair use anymore. It’s really not that complicated.
If you’re using the minimum amount, in a transformative way that doesn’t compete with the original copyrighted source, then it’s still fair use even if it’s commercial. (This is not saying that’s what LLM are doing)
The judge making this determination is an idiot.
The judge hasn’t ruled on the piracy question yet. The only thing that the judge has ruled on is, if you legally own a copy of a book, then you can use it for a variety of purposes, including training an AI.
“But they didn’t own the books!”
Right. That’s the part that’s still going to trial.
Judges: not learning a goddamned thing about computers in 40 years.
You’re poor? Fuck you you have to pay to breathe.
Millionaire? Whatever you want daddy uwu
That’s kind of how I read it too.
But as a side effect it means you’re still allowed to photograph your own books at home as a private citizen if you own them.
Prepare to never legally own another piece of media in your life. 😄
Gist:
What’s new: The Northern District of California has granted a summary judgment for Anthropic that the training use of the copyrighted books and the print-to-digital format change were both “fair use” (full order below box). However, the court also found that the pirated library copies that Anthropic collected could not be deemed as training copies, and therefore, the use of this material was not “fair”. The court also announced that it will have a trial on the pirated copies and any resulting damages, adding:
“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”
So I can’t use any of these works because it’s plagiarism but AI can?
My interpretation was that AI companies can train on material they are licensed to use, but the courts have deemed that Anthropic pirated this material as they were not licensed to use it.
In other words, if Anthropic bought the physical or digital books, it would be fine so long as their AI couldn’t spit it out verbatim, but they didn’t even do that, i.e. the AI crawler pirated the book.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Definitions of “Ownership” can be very different.
It seems like a lot of people misunderstand copyright so let’s be clear: the answer is yes. You can absolutely digitize your books. You can rip your movies and store them on a home server and run them through compression algorithms.
Copyright exists to prevent others from redistributing your work so as long as you’re doing all of that for personal use, the copyright owner has no say over what you do with it.
You even have some degree of latitude to create and distribute transformative works with a violation only occurring when you distribute something pretty damn close to a copy of the original. Some perfectly legal examples: create a word cloud of a book, analyze the tone of news article to help you trade stocks, produce an image containing the most prominent color in every frame of a movie, or create a search index of the words found on all websites on the internet.
You can absolutely do the same kinds of things an AI does with a work as a human.
You can digitize the books you own. You do not need a license for that. And of course you could put that digital format into a database. As databases are explicit exceptions from copyright law. If you want to go to the extreme: delete first copy. Then you have only in the database. However: AIs/LLMs are not based on data bases. But on neural networks. The original data gets lost when “learned”.
Does buying the book give you license to digitise it?
Does owning a digital copy of the book give you license to convert it into another format and copy it into a database?
Yes. That’s what the court ruled here. If you legally obtain a printed copy of a book you are free to digitize it or archive it for yourself. And you’re allowed to keep that digital copy, analyze and index it and search it, in your personal library.
Anthropic’s practice of buying physical books, removing the bindings, scanning the pages, and digitizing the content while destroying the physical book was found to be legal, so long as Anthropic didn’t distribute that library outside of its own company.
Why would it be plagiarism if you use the knowledge you gain from a book?
Formatting thing: if you start a line in a new paragraph with four spaces, it assumes that you want to display the text as a code and won’t line break.
This means that the last part of your comment is a long line that people need to scroll to see. If you remove one of the spaces, or you remove the empty line between it and the previous paragraph, it’ll look like a normal comment
With an empty line of space:
1 space - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
2 spaces - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
3 spaces - and a little bit of writing just to see how the text will wrap. I don’t really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
4 spaces - and a little bit of writing just to see how the text will wrap. I don't really have anything that I want to put here, but I need to put enough here to make it long enough to wrap around. This is likely enough.
Personally I prefer to explicitly wrap the text in backticks.
Three ` symbols will
Have the same effect
But the behavior is more clear to the author
Thanks, I had copy-pasted it from the website :)
Ok so you can buy books scan them or ebooks and use for AI training but you can’t just download priated books from internet to train AI. Did I understood that correctly ?
Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.
The court’s ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.
But the facts before the court were that Anthropic’s LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.
Does it “generate” a 1:1 copy?
You can train an LLM to generate 1:1 copies
Yeah I have a bash one liner AI model that ingests your media and spits out a 99.9999999% accurate replica through the power of changing the filename.
cp
Out performs the latest and greatest AI models
mv
will save you some disk space.This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
But, corporations are allowed to buy books normally and use them in training.
I am training my model on these 100,000 movies your honor.
This ruling stated that corporations are not allowed to pirate books to use them in training. Please read the headlines more carefully, and read the article.
thank you Captain Funsucker!
It took me a few days to get the time to read the actual court ruling but here’s the basics of what it ruled (and what it didn’t rule on):
- It’s legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn’t give permission. And even if you bought the books used, for very cheap, in bulk.
- It’s legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
- It’s legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
- It’s legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder’s permission.
- It’s illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder’s permission.
Here’s what it didn’t rule on:
- Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn’t legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
- Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
- Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).
So it’s a pretty important ruling, in my opinion. It’s a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder’s permission, as long as you first own a legal copy in the first place. And it’s a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.
Unpopular opinion but I don’t see how it could have been different.
- There’s no way the west would give AI lead to China which has no desire or framework to ever accept this.
- Believe it or not but transformers are actually learning by current definitions and not regurgitating a direct copy. It’s transformative work - it’s even in the name.
- This is actually good as it prevents market moat for super rich corporations only which could afford the expensive training datasets.
This is an absolute win for everyone involved other than copyright hoarders and mega corporations.
I’d encourage everyone upset at this read over some of the EFF posts from actual IP lawyers on this topic like this one:
Nor is pro-monopoly regulation through copyright likely to provide any meaningful economic support for vulnerable artists and creators. Notwithstanding the highly publicized demands of musicians, authors, actors, and other creative professionals, imposing a licensing requirement is unlikely to protect the jobs or incomes of the underpaid working artists that media and entertainment behemoths have exploited for decades. Because of the imbalance in bargaining power between creators and publishing gatekeepers, trying to help creators by giving them new rights under copyright law is, as EFF Special Advisor Cory Doctorow has written, like trying to help a bullied kid by giving them more lunch money for the bully to take.
Entertainment companies’ historical practices bear out this concern. For example, in the late-2000’s to mid-2010’s, music publishers and recording companies struck multimillion-dollar direct licensing deals with music streaming companies and video sharing platforms. Google reportedly paid more than $400 million to a single music label, and Spotify gave the major record labels a combined 18 percent ownership interest in its now-$100 billion company. Yet music labels and publishers frequently fail to share these payments with artists, and artists rarely benefit from these equity arrangements. There is no reason to believe that the same companies will treat their artists more fairly once they control AI.
You’re getting douchevoted because on lemmy any AI-related comment that isn’t negative enough about AI is the Devil’s Work.
Can I not just ask the trained AI to spit out the text of the book, verbatim?
They aren’t capable of that. This is why you sometimes see people comparing AI to compression, which is a bad faith argument. Depending on the training, AI can make something that is easily recognizable as derivative, but is not identical or even “lossy” identical. But this scenario takes place in a vacuum that doesn’t represent the real world. Unfortunately, we are enslaved by Capitalism, which means the output, which is being sold for-profit, is competing with the very content it was trained upon. This is clearly a violation of basic ethical principles as it actively harms those people whose content was used for training.
Even if the AI could spit it out verbatim, all the major labs already have IP checkers on their text models that block it doing so as fair use for training (what was decided here) does not mean you are free to reproduce.
Like, if you want to be an artist and trace Mario in class as you learn, that’s fair use.
If once you are working as an artist someone says “draw me a sexy image of Mario in a calendar shoot” you’d be violating Nintendo’s IP rights and liable for infringement.
What a bad judge.
This is another indication of how Copyright laws are bad. The whole premise of copyright has been obsolete since the proliferation of the internet.
It’s pretty simple as I see it. You treat AI like a person. A person needs to go through legal channels to consume material, so piracy for AI training is as illegal as it would be for personal consumption. Consuming legally possessed copywritten material for “inspiration” or “study” is also fine for a person, so it is fine for AI training as well. Commercializing derivative works that infringes on copyright is illegal for a person, so it should be illegal for an AI as well. All produced materials, even those inspired by another piece of media, are permissible if not monetized, otherwise they need to be suitably transformative. That line can be hard to draw even when AI is not involved, but that is the legal standard for people, so it should be for AI as well. If I browse through Deviant Art and learn to draw similarly my favorite artists from their publically viewable works, and make a legally distinct cartoon mouse by hand in a style that is similar to someone else’s and then I sell prints of that work, that is legal. The same should be the case for AI.
But! Scrutiny for AI should be much stricter given the inherent lack of true transformative creativity. And any AI that has used pirated materials should be penalized either by massive fines or by wiping their training and starting over with legally licensed or purchased or otherwise public domain materials only.
But AI is not a person. It’s very weird idea to treat it like a person.
No it’s a tool, created and used by people. You’re not treating the tool like a person. Tools are obviously not subject to laws, can’t break laws, etc… Their usage is subject to laws. If you use a tool to intentionally, knowingly, or negligently do things that would be illegal for you to do without the tool, then that’s still illegal. Same for accepting money to give others the privilege of doing those illegal things with your tool without any attempt at moderating said things that you know is happening. You can argue that maybe the law should be more strict with AI usage than with a human if you have a good legal justification for it, but there’s really no way to justify being less strict.
Bangs
gabblegavel.Gets sack with dollar sign
“Oh good, my laundry is done”
*gavel