OpenAI just admitted it can't identify AI-generated text. That's bad for the internet and it could be really bad for AI models.

L4sBot@lemmy.world · 2 years ago

OpenAI just admitted it can't identify AI-generated text. That's bad for the internet and it could be really bad for AI models.

arthurpizza@lemmy.world · 2 years ago

We built a machine to mimic human writing. There’s going to a point where there is no difference. We might already be there.

Secret@lemmy.world · 2 years ago

The machine used to mimic human text uses human text. If it can’t find the difference in it’s text and human text, it will begin using AI text to mimic human text. This will eventually lead to errors, repetitions, and/or less human like text.

pyrflie@lemmy.world · 2 years ago

We are already seeing it 1 year into GPT as human authors bow out when not paid.

BackupRainDancer@lemmy.world · edit-2 2 years ago

Predictable issue if you knew the fundamental technology that goes into these models. Hell it should have been obvious it was headed this way to the layperson once they saw the videos and heard the audio.

We’re less sensitive to patterns in massive data, the point at which we cant tell fact from ai fiction from the content is before these machines can’t tell. Good luck with the FB aunt’s.

GANs final goal is to develop content that is indistinguishable… Are we surprised?

Edit since the person below me made a great point. GANs may be limited but there’s nothing that says you can’t setup a generator and detector llm with the distinct intent to make detectors and generators for the sole purpose of improving the generator.

RandomlyAssigned@lemmy.world · 2 years ago

On the one hand, our AI is designed to mimic human text, on the other hand, we can detect AI generated text that was designed to mimic human text. These two goals don’t align at a fundamental level

Ogmios@lemmy.world · 2 years ago

Bluntly, even before AI there was an ever present threat that anything you encountered online was written by someone with ulterior motives. Maybe AI is just making it easier for people to digest because they don’t want to distrust people. The solution that I see is to always be aware of what other reasons any particular media could be serving, and to maintain a clear picture in your own mind of what’s important to you, so no matter who wrote something for what reason, you won’t be personally misled.

volodymyr@lemmy.world · 2 years ago

I don’t think it’s possible to always assume you can be misled, the influences remain even when they are not noticed. Also it is not advisable to be too suspicious, this breeds conspiratorial mindset. This is a dark side of critical thinking. Information space is already loaded with trash, and AI is about to amplify it. I think we need personal identity management, and AI agents will have their identities too. The danger is that this is hard to do in free internet. But it is possible in part, there are technologies.

Ogmios@lemmy.world · 2 years ago

Frankly, with open access to the entire world, there are a very large number of completely real conspiracies which you are connected too, through intelligence agencies, mafias and terrorist organizations. Failure to recognize this fact is a big problem with the way the Internet has been designed.

volodymyr@lemmy.world · 2 years ago

There are real conspiracies but conspirstorial mindset is still unhealthly. There is a joke “even if you do have paranoia, it does not mean that THEY are not actually watching you”. It’s just important not do descend into paranoia, even if it starts by legitimate concerns. It’s also important to be aware that one person can not derive all knowledge for by themselves, so it is necessary to trust, even conditionally. But right now, there is no established technical process helping to choose how to trust. I just belive that most people in here are not bots and not crazy.

thebestaquaman@lemmy.world · 2 years ago

This just illustrates the major limitation of ML: Access to reliable training data. A machine that has no concept of internal reasoning can never be truly trusted to solve novel problems, and novel problems, from minor issues to very complex ones, are solved in a bunch of professions every day. That’s what drives our world forward. If we rely too heavily on AI to solve problems for us, the issue of obtaining reliable training data to train future AI’s will only expand. That’s why I currently don’t think AI’s will replace large swaths of the work force, but to a larger degree be used as a tool by the humans in the workforce.

ChrislyBear@lemmy.world · 2 years ago

So every accusation of cheating/plagiarism etc. and the resulting bad grades need to be revised because the AI checker incorrectly labelled submissions as “created by AI”? OK.

Womble@lemmy.world · 2 years ago

I mean, based on the dozens of AI generated essays I’ve seen, its not hard to spot them based on the many hallucinated references they all have in them.

average650@lemmy.world · 2 years ago

I mean, the entire goal of the technology was to create human-like text.

kvothelu@lemmy.world · 2 years ago

i wonder why Google is still not considering buying reddit and other forums where personal discussion takes place and most user base sort quality content free of charge. it has been established already that Google queries are way more useful when coupled with reddit

Move to lemm.ee@lemmy.world · 2 years ago

Making google better is not google’s goal. Growth is their goal.

Techmaster@lemmy.world · 2 years ago

Relax, everybody. I have figured out the solution. We pass a law that all AI generated text has to be in Pig Latin or Ubbi Dubbi.

professor_entropy@lemmy.world · edit-2 2 years ago

FWIW It’s not clear cut if AI generated data feeding back into further training reduces accuracy, or is generally harmful.

Multiple papers have shown that generated images by high quality diffusion models with a proportion of real images in mix (30-50%) improve the adversarial robustness of the models. Similiar things might apply to language modeling.

Matriks404@lemmy.world · 2 years ago

I wonder if AI generated texts (or speech) will impact our language. Kinda interesting thing to think about.

cerevant@lemmy.world · 2 years ago

If it could, it couldn’t claim that the content out produced was original. If AI generated content were detectable, that would be a tacit admission that it is entirely plagiarized.

Toneswirly@lemmy.world · 2 years ago

OpenAI also financially benefits from keeping the hype training rolling. Talking about how disruptive their own tech is gets them attention and investments. Just take it with a grain of salt.

diffuselight@lemmy.world · 2 years ago

Its not possible to tell AI generated text from human writing at any level of real world accuracy. Just accept that.

Toneswirly@lemmy.world · 2 years ago

Citation needed

diffuselight@lemmy.world · edit-2 2 years ago

The entropy in text is not good enough to provide enough space for watermarking. No it does not get better in longer text because you have control over i lot/chunking. You have control over top-k and temperature and prompt which creates infinite output space. Open text-generation-webui, go to the parameter page and count the number of parameters you can adjust to guide outcome. In the future you can add wasm encoded grammar to that list too.

Server side hashing / watermarking can be trivially defeated via transformations / emoji injection Latent space positional watermarking breaks easily with post processing. It would also kill any company trying to sell it (Apple be like … you want all your chats at openAI or in the privacy of your phone?) and ultimately be massively dystopian.

Unlike plagiarism checks you can’t compare to a ground truth.

Prompt guidance can box in the output space to a point you could not possibly tell it’s not human. The technology has moved from central servers to the edge, even id you could build something for one LLM, another one not in your control, like a local LLAMA which is open source (see how quickly Stable Diffusion 2 Vae watermarking was removed after release)

In a year your iphone will have a built in LLM. Everything will have LLMs, some highly purpose bound with only a few M parameters. Finetuning like LoRa is accessible to a large number of people with consumer GPUs today and will be commoditized in a year. Since it can shape the output, it again increases the possibility space of outputs and will scramble patterns.

Finally, the bar is not “better than a flip of a coin. If you are going to accuse people or ruin their academic career, you need triple nine accuracy or you’ll wrongfully accuse hundreds of essays a semester.

The most likely detection would be if someone finds a remarkable stable signature that magically works for all the models out there (100s by now), doesn’t break with updates (lol - see chatgpt presumably getting worse), survives quantisation and somehow can be kept secret from everyone including AI which can trivially spot patterns in massive data sets. Not Going To Happen.

Even if it was possible to detect, it would be model or technology specific and lagging technology - we are moving at 2000miles and hour and in a year it may mot be transformers. They’ll be GAN or RNN elements fused into it or something completely new.

The entire point of the technology is to approximate humanity - plus we are moving at it from the other direction - more and more conventional tools embed AI (from your camera not being able to take non AI touched pictures anymore to Photoshop infill to word autocomplete to new spellchecking and grammar models).

People latch onto the idea that you can detect it because it provides an escapism fantasy and copium so they don’t have to face the change that is happening. If you can detect it you can keep it out. You can’t. Not against anyone who has even the slightest idea of how to use this stuff.

It’s like gunpowder was invented and Samurai would throw themselves into the machine guns because it rendered decades of training and perfection, of knowledge about fortification, war and survival moot.

On video detection will remain viable for a long time due to the available entropy. Text. It’s always been snakeoil and everyone peddling it should be shot.

L3s@lemmy.world · 2 years ago

deleted by creator

AllonzeeLV@lemmy.world · edit-2 2 years ago

deleted by creator