@lily33

lily33@lemmy.world · 2 years ago

They don’t redistribute. They learn information about the material they’ve been trained on - not there natural itself*, and can use it to generate material they’ve never seen.

Bigger models seem to memorize some of the material and can infringe, but that’s not really the goal.

lily33@lemmy.world · edit-2 2 years ago

Language models actually do learn things in the sense that: the information encoded in the training model isn’t usually* taken directly from the training data; instead, it’s information that describes the training data, but is new. That’s why it can generate text that’s never appeared in the data.

the bigger models seem to remember some of the data and can reproduce it verbatim; but that’s not really the goal.

lily33@lemmy.world · edit-2 2 years ago

It’s specifically distribution of the work or derivatives that copyright prevents.

So you could make an argument that an LLM that’s memorized the book and can reproduce (parts of) it upon request is infringing. But one that’s merely trained on the book, but hasn’t memorized it, should be fine.