A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

assassin_aragorn@lemmy.world · 2 years ago

A.I.’s un-learning problem: Researchers say it’s virtually impossible to make an A.I. model ‘forget’ the things it learns from private user data

Primarily0617@kbin.social · edit-2 2 years ago

it’s crazy that “it’s too hard :(” has become an acceptable justification for just ignoring the law within tech circles

BrianTheeBiscuiteer@lemmy.world · 2 years ago

I’m not an AI expert, and I wouldn’t say it is too hard, but I believe removing a specific piece of data from a model is like trying to remove excess salt from a stew. You can add things to make the stew less salty but you can’t really remove the salt.

The alternative, which is a lot of effort but boo-hoo for big tech, is to throw out the model and start over without the data in question. These companies would do well to start with models built on public or royalty free data and then add more risky data on top of that (so you only have to rebake starting from the “public” version).

Primarily0617@kbin.social · 2 years ago

sounds like big tech shouldn’t have spent the last decade investing in a kitchen refit so that they could make stew really well but nothing else

GoosLife@lemmy.world · edit-2 2 years ago

If there’s something illegal in your dish, you throw it out. It’s not a question. I don’t care that you spent a lot of time and money on it. “I spent a lot of time preparing the circumstances leading to this crime” is not an excuse, neither is “if I have to face consequences for committing this crime, I might lose money”.

Grandwolf319@sh.itjust.works · 2 years ago

Replace salt with poison or an allergenic substance and if fully holds. If a batch has been contaminated, then yes, you should try again.

But now that the cat is out of the bag, other companies are less willing to let something be scrap able due to how valuable it can be.

I think big tech knew this, that they can only build these models on unfiltered data before the AI craze.

Zeth0s@lemmy.world · edit-2 2 years ago

It’s actually a pretty normal thing in law. Laws are created with common sense in mind and compromises.

Currently EU laws do not cover generative AI. Now EU needs to decide how to deal with it. If consider it as a “lossy compressed database”, trying to enforce a variation of gdpr with added fuzziness, or do something else

garyyo@lemmy.world · 2 years ago

Always has been. The laws are there to incentivize good behavior, but when the cost of complying is larger than the projected cost of not complying they will ignore it and deal with the consequences. For us regular folk we generally can’t afford to not comply (except for all the low stakes laws that you break on a day to day basis), but when you have money to burn and a lot is at stake, the decision becomes more complicated.

The tech part of that is that we don’t really even know if removing data from these sorts of model is possible in the first place. The only way to remove it is to throw away the old one and make a new one (aka retraining the model) without the offending data. This is similar to how you can’t get a person to forget something without some really drastic measures, even then how do you know they forgot it, that information may still be used to inform their decisions, they might just not be aware of it or feign ignorance. Only real way to be sure is to scrap the person. Given how insanely costly it can be to retrain a model, the laws start looking like “necessary operating costs” instead of absolute rules.

FaceDeer@kbin.social · 2 years ago

It’s more like the law is saying you must draw seven red lines, all of them strictly perpendicular, some with green ink and some with transparent ink.

It’s not “virtually” impossible, it’s literally impossible. If the law requires that it be possible then it’s the law that must change. Otherwise it’s simply a more complicated way of banning AI entirely, which means that some other jurisdiction will become the world leader in such things.

Ottomateeverything@lemmy.world · 2 years ago

It’s more like the law is saying you must draw seven red lines, all of them strictly perpendicular, some with green ink and some with transparent ink.

No, it’s more like the law is saying you have to draw seven red lines and you’re saying, “well I can’t do that with indigo, because indigo creates purple ink, therefore the law must change!” No, you just can’t use indigo. Find a different resource.

It’s not “virtually” impossible, it’s literally impossible. If the law requires that it be possible then it’s the law that must change.

There’s nothing that says AI has to exist in a form created from harvesting massive user data in a way that can’t be reversed or retracted. It’s not technically impossible to do that at all, we just haven’t done it because it’s inconvenient and more work.

The law sometimes makes things illegal because they should be illegal. It’s not like you run around saying we need to change murder laws because you can’t kill your annoying neighbor without going to prison.

Otherwise it’s simply a more complicated way of banning AI entirely

No it’s not, AI is way broader than this. There are tons of forms of AI besides forms that consume raw existing data. And there are ways you could harvest only data you could then “untrain”, it’s just more work.

Some things, like user privacy, are actually worth protecting.

Primarily0617@kbin.social · 2 years ago

ok i guess you don’t get to use private data in your models too bad so sad

why does the capitalistic urge to become “the world leader” in whatever technology-of-the-month is popular right now supersede a basic human right to privacy?

DigitalWebSlinger@lemmy.world · 2 years ago

“AI model unlearning” is the equivalent of saying “removing a specific feature from a compiled binary executable”. So, yeah, basically not feasible.

But the solution is painfully easy: you remove the data from your training set (ie, the source code), and re-train your model (recompile the executable).

Yes, it may cost you a lot of time and money to accomplish this, but such are the consequences of breaking the law. Maybe be extra careful about obeying laws going forward, eh?

londos@lemmy.world · 2 years ago

Far cheaper to just buy politicians and change the law.

Ann Archy@lemmy.world · 2 years ago

Just ask the AI to do it for you. Much better return on investment.

Ajen@sh.itjust.works · 2 years ago

removing a specific feature from a compiled binary executable

That’s actually very feasible. Compiled binaries translate directly to assembly, which is taught to most (all?) comp sci undergrads. When the binary is compiled by a standard compiler the translated assembly is very easy to understand, and for software that has protections/obfuscations like DRM and viruses there are reverse engineering tools like IDA Pro.

Dkarma@lemmy.world · edit-2 2 years ago

It takes so.much money to retrain models tho…like the entire cost all over again …and what if they find something else?

Crazy how murky the legalities are here …just no caselaw to base anything on really

For people who don’t know how machine learning works at a very high level

basically every input the AI is trained on or “sees” changes a set of weights (float type decimal numbers) and once the weights are changed you can’t remove that input and change the weights back to what they were you can only keep changing them on new input

DigitalWebSlinger@lemmy.world · 2 years ago

So we just let them break the law without penalty because it’s hard and costly to redo the work that already broke the law? Nah, they can put time and money towards safeguards to prevent themselves from breaking the law if they want to try to make money off of this stuff.

Dkarma@lemmy.world · 2 years ago

No one has established that they’ve broken the law in any way, though. Authors are upset but it’s unclear if they can prove they were damaged in some way or that the companies in question are even liable for anything.

Remember,the burden of proof is on the plaintiff not these companies if a suit is brought.

AWittyUsername@lemmy.world · 2 years ago

Much like DLLs exist for compiled binary executables, could we not have modular AI training data? Then only a small chunk would need to be relearned at a time.

Just throwing this into the void here.

Aceticon@lemmy.world · 2 years ago

The difference in between having or not something in the training set of a Neural Network is going to be different values for non-integer factors all over the neural network and, worse, it is just as like that they’re tiny differences as it is that they’re massive differences.

Or to give you a decent metaphor for it, “it would be like trying to remove a specific egg from a bowl of scrambled eggs”.

Dran@lemmy.world · 2 years ago

Or you know, if it’s impossible to strip out individual data, and it’s too expensive to retain/retrain models with data removed… Why is everyone overlooking “just don’t process private data, and only use public data in model training”?

Dojan@lemmy.world · 2 years ago

Yeah. Penalise it heavily so if you need to make a model, make manually vetting the data the most affordable option.

Ultimately, ensuring models are trained on safe, good, legal data, and not just random bullshit scraped off of the internet, will just be a net positive overall.

assassin_aragorn@lemmy.world · 2 years ago

Along those lines, perhaps you put in a stipulation that you don’t have to toss the model if you instead give the person a significant sum in royalties. After all, if their data isn’t a lynchpin in the model, you didn’t need it in the first place, and if it is crucial, you should pay them accordingly.

Punitive regulations seem to be the best way to make companies grow a sense of ethics.

CookieJarObserver@sh.itjust.works · 2 years ago

Just kill ot off and start from the beginning.

Treczoks@lemmy.world · 2 years ago

Delete the AI and restart the training from the original sources minus the information it should not have learned in the first place.

And if they claim “this is more complicated than that” you know their process is f-ed up.

alternative_factor@kbin.social · 2 years ago

For the AI heads here: is this another problem caused by the “black box” style of LLM creation where they don’t really know how it actually works, so they don’t really know how to take out the data?

orclev@lemmy.world · 2 years ago

They know how it works. It’s a statistical model. Given a sequence of words, there’s a set of probabilities for what the next word will be. That’s the problem, an LLM doesn’t “know” anything. It’s not a collection of facts. It’s like a pachinko machine where each peg in the machine is a word. The prompt you give it determines where/how the ball gets dropped in and all the pins it hits on the way down corresponds to the output. How those pins get labeled is the learning process. Once that’s done there really isn’t any going back. You can’t unscramble that egg to pick out one piece of the training data.

garyyo@lemmy.world · 2 years ago

While you are overall correct, there is still a sort of “black box” effect going on. While we understand the mechanics of how the network architecture works the actual information encoded by training is, as you have said, not stored in a way that is easily accessible or editable by a human.

I am not sure if this is what OP meant by it, but it kinda fits and I wanted to add a bit of clarification. Relatedly, the easiest way to uncook (or unscramble) an egg is to feed it to a chicken, which amounts to basically retraining a model.

DharkStare@lemmy.world · 2 years ago

I really like that pachinko analogy. It gets the basic concept across without having to wade into technical descriptions.

darth_helmet@sh.itjust.works · 2 years ago

https://www.understandingai.org/p/large-language-models-explained-with I don’t think you’re intending to be purposefully misleading, but I would recommend checking this article out because the pachinko analogy is not accurate, really. There are several layers of considerations that the model makes when analyzing context to derive meaning. How well these models do with analogies is, I think, a compelling case for the model having, if not “knowledge” of something, at least a good enough analogue to knowledge to be useful.

Training a model on the way we use language is also training the model on how we think, or at least how we express our thoughts. There’s still a ton of gaps to work on before it’s an AGI, but LLMs are on to what’s looking more and more like the right path to getting there.

orclev@lemmy.world · 2 years ago

While it glosses over a lot of details it’s not fundamentally wrong in any fashion. A LLM does not in any meaningful fashion “know” anything. Training an LLM is training it on what words are used in relation to each other in different contexts. It’s like training someone to sing a song in a foreign language they don’t know. They can repeat the sounds and may even recognize when certain words often occur in proximity to each other, but that’s a far cry from actually understanding those words.

A LLM is in no way shape or form anything even remotely like a AGI. I wouldn’t even classify a LLM as AI. LLM are machine learning.

The entire point I was trying to make though is that a LLM does not store specific training data, rather what it stores is more like the hashed results of its training data. It’s a one way transform, there is absolutely no way to start at the finished model and drive it backwards to derive its training input. You could probably show from its output that it’s highly likely some specific piece of data was used to train it, but even that isn’t absolutely certain. Nor can you point at any given piece of the model and say what specific part of the training data it corresponds to or vice versa. Because of that it’s impossible to pluck out some specfic piece of data from the model. The only way to remove data from the model is to throw the model away and train a new model from the original training data with the specific data removed from it.

Jerkface@lemmy.world · 2 years ago

Sort of. We know ‘how it works’ to the extent that it was engineered with a particular method and purpose. The problem is that it’s incredibly difficult to gain any insight into what’s ‘inside’ the network once the data has been propagated through it.

Visualizing a neural network can look a little bit like a constellation of stars. Each star is a node and is connected to other nodes. When given an input, each node makes a small calculation and passes the result to the other nodes they are connected to. The calculation is modified by the connection (by what is called a weight), and the results of the calculations change the weights of the connections. That’s what’s in the black box.

The constellations in an LLM are very large (the first L in LLM). Each ‘layer’ may have hundreds of nodes, each of which is connected to every node of the next layer. If there are 100 nodes in two adjacent layers, that makes 10,000 connections. There are many layers in an LLM.

Notice that I didn’t mention anything about the nodes or the connections storing any data. That’s because they don’t, at least in the sense that we’re used to thinking about it. There doesn’t exist a string of text that says ‘Bill Burr’s SSN is ###-##-####’. It’s just the nodes that do the calculations, and the weights of their connections.

So by now you can probably see why it’s so tricky to determine what’s ‘inside’ a neural network, because really it’s a set of operations instead of a set of data. The most reliable way to see what it does (so far) is to put something in and see what comes out.

FaceDeer@kbin.social · 2 years ago

More that they know enough about how it works that they know it’s impossible to do. The data isn’t stored like files on a hard drive, in some discrete bundle of bytes somewhere, and the problem is simply trying to find and erase them. It’s stored as a distributed haze of weightings spread out over all of the nodes in the network, blended with all the other distributed hazes of everything else that the AI knows. A court may as well order a human to forget a specific fact, memories are stored in a similar manner.

Best the law can probably do right now is forbid AIs from speaking about certain facts. And even then as we’ve seen with the like of ChatGPT there will be ways to talk around such bans.

garyyo@lemmy.world · 2 years ago

they know it’s impossible to do

There is some research into ML data deletion and its shown to be possible, but maybe not on larger scales and maybe not something that is actually feasible compared to retraining.

eltimablo@kbin.social · 2 years ago

Think of it like this: you need a bunch of data points to determine the average of them all, but if you’re only given the average of a group of numbers, you can’t then go back and determine the original data points. It just doesn’t work like that.

reddig33@lemmy.world · 2 years ago

Sounds like bullshit.

Viking_Hippie@lemmy.world · 2 years ago

The Danish government, which has historically been very good about both privacy rights and workers’ rights has recently suggested that they are looking into fixing the nurses shortage “via AI”.

Our current government is probably the stupidest, most irresponsible and least humanitarian one we’ve had in my 40 year lifetime if not longer 🤬

Nora@sh.itjust.works · 2 years ago

“virtually” impossible. hehehe

Pichu0102@kbin.social · 2 years ago

I feel like one way to do this would be to break up models and their training data into mini-models and mini-batches of training data instead of one big model, and also restricting training data to that used with permission as well as public domain sources. For all other cases where a company is required to take down information in a model that their permission to use was revoked or expired, they can identify the relevant training data in the mini batches, remove it, then retrain the corresponding mini model more quickly and efficiently than having to retrain the entire massive model.

A major problem with this though would be figuring out how to efficiently query multiple mini models and come up with a single response. I’m not sure how you could do that, at least very well…

Strawberry@lemmy.blahaj.zone · 2 years ago

You could certainly break up training data, but breaking up the models into mini models based on which training data is used wouldn’t work with neural networks trained using gradient descent. Basically whatever the state of the model is it depends on the totality of the training data that it has been trained on (and the order) and it isn’t possible to go and remove the effect of a specific training data point without then retraining for all of the data that followed that data point (and even that assumes you were storing a snapshot of the model before every single training data point, which I doubt anyone does)

However, that’s no excuse and it is of course possible to entirely retrain a network using a clean dataset and that is what these companies should do

eltimablo@kbin.social · 2 years ago

I believe this is how the Tesla FSD beta AI works.

over_clox@lemmy.world · 2 years ago

Have you tried…

format Earth

AsunasPersonalAsst@lemmy.world · 2 years ago

Then why they put it in in the first place no? 👁👄👁

Fades@lemmy.world · 2 years ago

Everyone in the thread so triggered lol, so you hear yourselves?

eltimablo@kbin.social · 2 years ago

If anyone on Lemmy were capable of introspection, we wouldn’t have Lemmygrad or Beehaw.