The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

Pro@programming.dev · 21 days ago

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

andallthat@lemmy.world · edit-2 21 days ago

Basically, model collapse happens when the training data no longer matches real-world data

I’m more concerned about LLMs collaping the whole idea of “real-world”.

I’m not a machine learning expert but I do get the basic concept of training a model and then evaluating its output against real data. But the whole thing rests on the idea that you have a model trained with relatively small samples of the real world and a big, clearly distinct “real world” to check the model’s performance.

If LLMs have already ingested basically the entire information in the “real world” and their output is so pervasive that you can’t easily tell what’s true and what’s AI-generated slop “how do we train our models now” is not my main concern.

As an example, take the judges who found made-up cases because lawyers used a LLM. What happens if made-up cases are referenced in several other places, including some legal textbooks used in Law Schools? Don’t they become part of the “real world”?

Khanzarate@lemmy.world · 21 days ago

No, because there’s still no case.

Law textbooks that taught an imaginary case would just get a lot of lawyers in trouble, because someone eventually will wanna read the whole case and will try to pull the actual case, not just a reference. Those cases aren’t susceptible to this because they’re essentially a historical record. It’s like the difference between a scan of the declaration of independence and a high school history book describing it. Only one of those things could be bullshitted by an LLM.

Also applies to law schools. People do reference back to cases all the time, there’s an opposing lawyer, after all, who’d love a slam dunk win of “your honor, my opponent is actually full of shit and making everything up”. Any lawyer trained on imaginary material as if it were reality will just fail repeatedly.

LLMs can deceive lawyers who don’t verify their work. Lawyers are in fact required to verify their work, and the ones that have been caught using LLMs are quite literally not doing their job. If that wasn’t the case, lawyers would make up cases themselves, they don’t need an LLM for that, but it doesn’t happen because it doesn’t work.

thedruid@lemmy.world · 21 days ago

It happens all the time though. Made up and false facts being accepted as truth with no veracity.

So hard disagree.

Khanzarate@lemmy.world · 21 days ago

The difference is, if this were to happen and it was found later that a court case crucial to the defense were used, that’s a mistrial. Maybe even dismissed with prejudice.

Courts are bullshit sometimes, it’s true, but it would take deliberate judge/lawyer collusion for this to occur, or the incompetence of the judge and the opposing lawyer.

Is that possible? Sure. But the question was “will fictional LLM case law enter the general knowledge?” and my answer is “in a functioning court, no.”

If the judge and a lawyer are colluding or if a judge and the opposing lawyer are both so grossly incompetent, then we are far beyond an improper LLM citation.

TL;DR As a general rule, you have to prove facts in court. When that stops being true, liars win, no AI needed.

thedruid@lemmy.world · 21 days ago

To put a fiber point, in not arguing that s. I should be used in court. That’s just a bad idea. I’m saying that B. S has been used as fact , look at the way history is taught in most countries. Very biased towards their own ruling class, usually involves living lies of some sort

londos@lemmy.world · 21 days ago

My first thought was that it would make a cool sci fi story where future generations lose all documented history other than AI-generated slop, and factions war over whose history is correct and/or made-up disagreements.

And then I remembered all the real life wars of religion…

Grandwolf319@sh.itjust.works · 21 days ago

Maybe, but even if that’s not an issue, there is a bigger one:

Law of diminishing returns.

So to double performance, it takes much more than double of the data.

Right now LLMs aren’t profitable even though they are more efficient compared to using more data.

All this AI craze has taught me is that the human brain is super advanced given its performance even though it takes the energy of a light bulb.

RaptorBenn@lemmy.world · 21 days ago

If it wasn’t a fledgingling technology with a lot more advancements to be made yet, I’d worry about that.

rottingleaf@lemmy.world · 20 days ago

All this AI craze has taught me is that the human brain is super advanced given its performance even though it takes the energy of a light bulb.

Seemed superficially obvious.

Human brain is a system optimization of which took energy of evolution since start of life on Earth.

That is, infinitely bigger amount of data.

It’s like comparing a barrel of oil to a barrel of soured milk.

CosmoNova@lemmy.world · 21 days ago

No. Not necessarily but the internet will become worse nonetheless.

zecg@lemmy.world · 21 days ago

You mean poorlyer

Knock_Knock_Lemmy_In@lemmy.world · 20 days ago

https://en.wikipedia.org/wiki/Model_collapse

toastmeister@lemmy.ca · 20 days ago

deleted by creator

BeatTakeshi@lemmy.world · 21 days ago

Ouroboros effect

Shotgun_Alice@lemmy.world · 21 days ago

Fingers crossed.

rottingleaf@lemmy.world · 20 days ago

Yes please!

RaptorBenn@lemmy.world · 21 days ago

How about we dont feed AI to itself then? Seems like that’s just a choice we could make?

doodledup@lemmy.world · 21 days ago

Most LLMs seed their output so they can recognize whether something was created by them. I can see how there will be common standards for this and every LLM as it’s in the best interest of every commercial LLM to know whether something is LLM output or not.

Khanzarate@lemmy.world · 21 days ago

Nah that means you can ask an LLM “is this real” and get a correct answer.

That defeats the point of a bunch of kinds of material.

Deepfakes, for instance. International espionage, propaganda, companies who want “real people”.

A simple is_ai checkbox of any kind is undesirable, but those sources will end back up in every LLM, even one that was behaving and flagging its output.

You’d need every LLM to do this, and there’s open source models, there’s foreign ones. And as has already been proven, you can’t rely on an LLM detecting a generated product without it.

The correct way to do it would be to instead organize a not-ai certification for real content. But that would severely limit training data. It could happen once quantity of data isn’t the be-all end-all for a model, but I dunno when when or if that’ll be the case.

doodledup@lemmy.world · 21 days ago

LLM watermarking is economically desireble. Why would it be more profitable to train worse LLMs on LLM outputs? I’m curious for any argument.

Also, what has deep-fakes anything to do with LLMs? This is not related at all.

A certificate for “real” content is not feasible. It’s much easier to just prevent LLMs to train on LLM output.

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

The Collapse of GPT – Communications of the ACM