OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling’s Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.
Its a bit pedantic, but I’m not really sure I support this kind of extremist view of copyright and the scale of whats being interpreted as ‘possessed’ under the idea of copyright. Once an idea is communicated, it becomes a part of the collective consciousness. Different people interpret and build upon that idea in various ways, making it a dynamic entity that evolves beyond the original creator’s intention. Its like issues with sampling beats or records in the early days of hiphop. Its like the very principal of an idea goes against this vision, more that, once you put something out into the commons, its irretrievable. Its not really yours any more once its been communicated. I think if you want to keep an idea truly yours, then you should keep it to yourself. Otherwise you are participating in a shared vision of the idea. You don’t control how the idea is interpreted so its not really yours any more.
If thats ChatGPT or Public Enemy is neither here nor there to me. The idea that a work like Peter Pan is still possessed is such a very real but very silly obvious malady of this weirdly accepted but very extreme view of the ability to possess an idea.
Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷
Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷
Yeah I suppose that’s on them.
Copyright definitely needs to be stripped back severely. Artists need time to use their own work, but after a certain time everything needs to enter the public space for the sake of creativity.
I’m a huge proponent of expanding individual copyright to extreme amounts (an individual is entitled to own the rights and usage rights to anything they create and can revoke those rights from anyone), but not in favor of the same thing for corporations.
I hold the exact opposite view as you. As long as it’s a truly creative work (a writing, music, artwork, etc) then you own that specific implementation of the idea. Someone can make something else based on it, but you still own the original content.
LLMs and companies using them need to pay for the content in some way. This is already done through licensing in other parallels, and will likely come to AI quickly.
To be clear, I’m still working through my thinking in this but it’s been something cooking for quite a while. I may not have all the words to express my meaning. For example, I think there are two routes to take in making my argument, one moral, the other technical. I’m not building on the morality of copyright, but focusing on the technical aspects of the limits of ideas.
I suppose I would ask you then about your views in authoritarianism. Because it seems to be that with out an extremely authoritarian state, it would be basically impossible to enforce your view of copyright. Are you okay with that kind of pervasiveness?
Also, from a technical perspective, how do you propose this view of copyright be applied? This is kind of towards the broader point I’m thinking I believe in. It’s not just that copyright laws are epifaci ridiculous, they are also technically almost unenforceable in their modern extremist interpretation with out an extremely pervasive form of surveillance.
Easy. The same way we already do it. We enforce music licensing, video licensing and other ip licensing. It’s been done. All this would do is extend those rights to the individual and remove them from corporations. Work product can be owned by companies, but not indefinitely. Individuals should always be in control of their creations.
Restrictions would more or less be strictly commercial, to where hobbyists wouldn’t be impacted, but as soon as it’s used to make money the original creators are owed as part of it.
It wouldn’t be any harder than it is now, as long as copyright is proved. (Hey look, this is the first time I’ve found an actual use of NFTs). In general anything being done for momentary gain is already monitored and surveilled, so this wouldn’t be a change there either.
Edit: Also most of us already live in authoritarian states. This won’t really change anything. Big corps already enforce their copyright when it makes monetary sense and are actively trolling for unauthorized uses.
It wouldn’t be any harder than it is now, as long as copyright is proved. (Hey look, this is the first time I’ve found an actual use of NFTs). In general anything being done for momentary gain is already monitored and surveilled, so this wouldn’t be a change there either.
Personally, I think you are describing a dystopian, authoritarian landscape which will be devoid of any real creativity or interesting ideas. I’m a believer that all ideas are free to be stolen, copied, improved upon; that imitation of ideas is a fundamental human right, and fundamental to what it means to be human. Likewise, I think our social and media landscape would be much poorer without this right. I don’t think any one has the inherent right to profit off of an idea.
I feel the exact opposite. There’s no reason for me to create anything if someone else can come along and steal it. Eliminating copyright will bring your dystopian landscape where nobody shares any sort of art or creative work because someone else will steal it.
What motivation is there for creatives if you’re just telling them their work has no implicit value and anyone else can come along and reappropriate it for whatever they’d like?
I feel the exact opposite. There’s no reason for me to create anything if someone else can come along and steal it. Eliminating copyright will bring your dystopian landscape where nobody shares any sort of art or creative work because someone else will steal it.
This is great because I think you are totally correct in your sentiment that we believe oppositely. I see art created only for the purpose of profit as drivel; true art is an expression of the self. If the only reason you make art is for profit, you aren’t an artist, you are an employee.
That’s a great theory and all, but it’s not even money. I make no money from my photos, but I also refrain from posting any of them because I’d rather they not be used for AI training. Same with any music I create and I’m getting there with my code.
The nobility of art has always been in question, and it’s consistently been proven that artists who aren’t compensated for their work also tend to create less.
This is also not explicitly about profit. If I write a song and then it’s used at a hate rally, I currently have no recourse. They’re not making money from that application (directly), but they are using my creation to promote something I don’t agree with.
I’m curious to know if you’re an artist yourself, as it’s very contrary to the opinions from other creatives I know.
I assume you’re against the communal and collective culture that modders for games enjoy?
I assume you also believe no technological innovations are produced in America anymore since China would simply steal it.
Nowhere did I say derivative works are not ok. If a game maker explicitly forbids using modded versions of their game, I think that should be up to them. Games that have vibrant modding communities are almost always at least partially supported by the developer anyways.
My points are individual copyright anyways, not corporate. With increasing individual protections I also propose decreasing corporate copyright protection.
I believe that China makes 90% of the same product for 80% of the price after ripping off their American counterparts. But that’s also entirely off topic and really has nothing to do with this. Art/Creative Works are entirely different than physical goods.
I hold whatever view makes George Lucas stop digitally remastering the original trilogy.
If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.
A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.
Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM’s “brain” has not yet been adjudicated by any court anywhere.
If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it “hiding” is backwards.
You are a human, you are allowed to create derivative works under the law. Copyright law as it relates to machines regurgitating what humans have created is fundamentally different. Future legislation will have to address a lot of the nuance of this issue.
And allowed get sued anyway
Another sensationalist title. The article makes it clear that the problem is users reconstructing large portions of a copyrighted work word for word. OpenAI is trying to implement a solution that prevents ChatGPT from regurgitating entire copyrighted works using “maliciously designed” prompts. OpenAI doesn’t hide the fact that these tools were trained using copyrighted works and legally it probably isn’t an issue.
you bought the book to memorize from, anyway.
No, I shoplifted it from an Aldi
Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.
Yep. Legally every word is copyrighted. Yes, law is THAT stupid.
People think it’s a broken system, but it actually works exactly how the rich want it to work.
The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It’s historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.
Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne
Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?
Because ultimately, it’s about the truth of things, and not what team is winning or losing.
The dream would be that they manage to make their own glorious free & open source version, so that after a brief spike in corporate profit as they fire all their writers and artists, suddenly nobody needs those corps anymore because EVERYONE gets access to the same tools - if everyone has the ability to churn out massive content without hiring anyone, that theoretically favors those who never had the capital to hire people to begin with, far more than those who did the hiring.
Of course, this stance doesn’t really have an answer for any of the other problems involved in the tech, not the least of which is that there’s bigger issues at play than just “content”.
Leftists hating on AI while dreaming of post-scarcity will never not be funny
Because everyone learns from books, it’s stupid.
So that explains the “problematic” responses.
I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.
Yeah, it refuses to give you the first sentence from Harry Potter now.
Which is kinda lame, you can find that on thousands of webpages. Many of which the system indexed.
If someone was looking to pirate the book there are way easier ways than issuing thousands of queries to ChatGPT. Type “Harry Potter torrent” into Google and you will have them all in 30 seconds.
ChatGPT has a ton of extra query qualifiers added behind the scenes to ensure that specific outputs can’t happen
This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they’ve been doing for a while, hence why their AI models are known to be massively censored. I wouldn’t call that ‘hiding’. It’s kind of hard to hide it was trained on copyrighted material, since that’s common knowledge, really.
Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.
The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.
The problem is this is a shitty, unethical way to determine who gets to survive and who doesn’t. All the current controversy about generative AI does is kick this can down the road a bit. But we’re going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.
Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.
what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.
Not knowing something is a crime doesn’t stop you from being prosecuted for committing it.
It doesn’t matter if someone else is sharing copyright works and you don’t know it and use it in ways that infringes on that copyright.
“I didn’t know that was copyrighted” is not a valid defence.
Is reading a passage from a book actually a crime though?
Sure, you could try to regenerate the full text from quotes you read online, much like you could open a lot of video reviews and recreate larger portions of the original text, but you would not blame the video editing program for that, you would blame the one who did it and decided to post it online.
That’s why this whole argument is worthless, and why I think that, at its core, it is disingenuous. I would be willing to be a steak dinner that a lot of these lawsuits are just fishing for money, and the rest are set up by competition trying to slow the market down because they are lagging behind. AI is an arms race, and it’s growing so fast that if you got in too late, you are just out of luck. So, companies that want in are trying to slow down the leaders, at best, and at worst they are trying to make them publish their training material so they can just copy it. AI training models should be considered IP, and should be protected as such. It’s like trying to get the Colonel’s secret recipe by saying that all the spices that were used have been used in other recipes before, so it should be fair game.
One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.
So yeah, case closed. They are full of shit.
are we no longer allowed to borrow books from friends?
Yeah, but if you wanna act out the contents of the book and sell it as a movie, you need to buy the rights.
Yes but there’s a threshold of how much you need to copy before it’s an IP violation.
Copying a single word is usually only enough if it’s a neologism.
Two matching words in a row usually isn’t enough either.
At some point it is enough though and it’s not clear what that point is.On the other hand it can still be considered an IP violation if there are no exact word matches but it seems sufficiently similar.
Until now we’ve basically asked courts to step in and decide where the line should be on a case by case basis.
We never set the level of allowable copying to 0, we set it to “reasonable”. In theory it’s supposed to be at a level that’s sufficient to, “promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” (US Constitution, Article I, Section 8, Clause 8).
Why is it that with AI we take the extreme position of thinking that an AI that makes use of any information from humans should automatically be considered to be in violation of IP law?
Making use of the information is not a violation – making use of that violation to turn a profit is a violation. AI software that is completely free for the masses without any paid upgrades can look at whatever it wants. As soon as a corporation is making money on it though, it’s in violation and needs to pay up.
Is that intended as a legal or moral position?
As far as I know, the law doesn’t care much if you make money off of IP violations. There are many cases of individuals getting hefty fines for both the personal use and free distribution of IP. I think if there is commercial use of IP the profits are forfeit to the IP holder. I’m not a lawyer though, so don’t bank on that.
There’s still the initial question too. At present, we let the courts decide if the usage, whether profitable or not, meets the standard of IP violation. Artists routinely take inspiration from one another and sometimes they take it too far. Why should we assume that AI automatically takes it too far and always meets the standard of IP violation?
It’s also just not fair, unless your going to rule that nothing an AI produces can be copyrighted. Otherwise some billionaire could just flood the office with copyrighted requests and copyrighted everything… hell if they really did that they would probably convince the government to let them hire outside contractors for free to speed up the process…
Idk that feels like saying that as soon as you sell the skills you learned on YouTube, you should have to start paying the people you learned from, since you’re “using” their copyrighted material to turn profit.
I don’t agree whatsoever that copyright extends to inspiration of other artists/data models. Unless they recreate what you’ve made in a sufficiently similar manor, they haven’t copied you.
Why is it that with AI we take the extreme position of thinking that an AI that makes use of any information from humans should automatically be considered to be in violation of IP law?
Luddites throwing their sabots into the machinery.
Not if your stories are transformative of the original work.
AI works are not transformative. No new content is added.
The work generated is entirely new
yes, but that’s a different situation. with the LLM, the issue is that the text from copyrighted books are influencing the way it speaks. this is the same with humans.
Mods remove this comment as this instance no longer tolerates discussions of piracy. We went through this last week
Google AI search preview seems to brazenly steal text from search results. Frequently its answers are the same word for word as a one of the snippets lower on the page
What the article is explaining is cliff notes or snippets of a story. Isn’t that allowed in some respect? People post notes from school books all the time, and those notes show up in Google searches as well.
I totally don’t know if I’m right, but doesn’t copyright infringement involve plagiarism like copying the whole book or writing a similar story that has elements of someone else’s work?
I don’t know what’s considered fair use here. But the point is it’s taking words that aren’t theirs, which will deprive websites of traffic because then people won’t click through to the source article.
Ok I get now. I can definitely see both sides of the argument, and it’s not going to be easy to solve.
Copyright law needs to be updated to deal with all the new ways people and companies are using tech to access copyrighted material.
Lol:
Content industry: It can reproduce our stuff OpenAI: Content industry: They are hiding that it can reproduce us
How are they going to prove if something was written by an AI?
It’s a complicated answer I’m unqualified to answer but essentially there exists some sort of baseline either for people or for how gpt responds usually and then they can figure out which way the text “leans”