Reddit usernames like ‘SolidGoldMagikarp’ are somehow causing the chatbot to give bizarre responses.
Spoiler alert it’s because of r/counting and some weirdness in training data
They really scraped all subs? No one had any concerns about that?
I believe, if this sort of generative AI is going to be trustworthy in the future, we need some sort of external verification system so we can make our own trust judgements based on the data used to train the system. For example, if a system is trained including 4chan as a data source, I’m going to trust it less than if it wasn’t trained using that source.
I don’t think big business yet realises how important the training data is but, as soon as they do, they will want the AI companies to provide guarantees about the sanity and appropriateness of the training data.
That’s how human intelligence works. We assign a value to the source of the information. The fact that the AI’s seemed to be trained without that explains why they “lie” so much. They simply reconstruct patterns without giving any weight to specific patterns.
For example, if you have the information “President Biden will launch a ground invasion of Russia.” If the New York Times, BBC, and CNN are all reporting it, we would give that information a higher likelihood of being true than if the information was found on random blogs. However, if the random blogs reporting the information belonged to reputable reporters or bloggers on military and international affairs, we would assign the information a higher value of being correct than if the information came from Bob’s Bigfoot and Alien sightings Index.
Without the ability to check the level of accuracy of source data, all the generative AI could be corrupted. If you fed an art AI photos of the Statue of Liberty but kept telling it that it was the Eiffel Tower, when asked to draw the Eiffel Tower it would spit out the Statue of Liberty. Right now, without the ability to assess the accuracy of a response, any of the chat-based AI are garbage for most of the use-cases companies are deploying them in.
Whether 4chan is a good data source or not depends on what you intend to use the AI for. If you want to have it interact with users on a web forum or similar context then using 4chan data would likely be very useful indeed.
Bear in mind that as long as it’s properly labelled then “bad” data is still useful as an example of bad data. A common example is with image AIs, where people can give negative prompts like “ugly” and “blurry” to tell the AI to make images that are not like that.
No. There’s a computerfile video but iirc r/counting was accidentally left in the training data set for part of the training process
r/SubredditSimulator? What could go wrong?
for anyone who was hoping to have fun with this, i regret to inform you that this is from back in February, which may as well be the 1990s in AI development time. none of this works on the current ChatGPT, and the nature of the problem is more well-known to researchers.
I’m not surprised they used Reddit data to train. I am shocked a bit at how fucking lazy and haphazard they were with the data.
There’s only logical arguments for anonymizing the data which it looks like they didn’t do. It’s such a massive privacy risk not to. It also puts the company at legal risk. Who knows what other bizarre info it’ll leak.
Bro do you not understand what a public Internet forum is?
Yeah, right? Reddit isn’t private like Lemmy or kbin. I’d be shocked to know this comment would be out in the open, but here it’s absolutely safe.
Yeah we use TLS, everything’s encrypted!
What is there to leak? This was likely public information.
The silliness of anonymizing data that’s already wide open in the public aside, if you were to anonymize the usernames you’d end up producing a worse AI because often the literal username of the person in question is significant to the context of what’s being written. Think of all the “relevant username” comments, for example. People make puns about usernames, berate people for having offensive usernames, and so forth. If those usernames were all replaced with anonymized substitutes the AI would be training on nonsense.
… trained on quite raw data, which included like a load of weird Reddit stuff, …
Probably all we need to know, right there.
TheNitromeFan
I haven’t thought about Nitrome in a long time! They made some amazing Flash games.
Very odd.