I have a theory about AI and I'm wondering if I'm wrong?

dreadknight11@lemmy.world · 12 days ago

I have a theory about AI and I'm wondering if I'm wrong?

brucethemoose@lemmy.world · edit-2 12 days ago

…No.

The way “AI” is set up now is as big blocks of “weights” and can run in two modes:

Inference: Running a model to generate some kind of prediction from output, e.g. the next word in a block of text for an LLM. Typically, this not so hard and ‘batched,’ e.g. a single GPU may serve 16 people at once, in parallel (though I am skipping many intricacies here).
Training: Taking a bunch of data (like big blocks of specifically formatted, processed text for LLMs), and altering the weights to fit it with glorified linear regression. This is typically done on TONS of tightly networked GPUs in more specialized setups. Usually 8x big ones in one server, at a minimum. And all the data selection/formatting is done by humans, by hand, though sometimes enhanced with algorithms that, say, generate a thinking trace. There’s also a distinction between ‘pretraining’ (making the initial model) and ‘finetuning’ (slightly altering it with new data, which is tricky).

Point I’m making is: you cannot do both at once.

You can ‘infer’ AI, but internally, it will never change. Its never learns.

You can train it with new data, but this is a huge and very manual/finicky and infrequent endeavor. And it takes a long time.

Theoretically ‘learning on the go’ is a goal of machine learning, but right now Big Tech is acting about as innovative as a brick, and just scaling up architectures and stoking egos rather than paying attention to this kind of research. The Chinese LLM companies are also being relatively conservative too, but in a different way.

Also, the recent fad (especially in China) is to make and use synthetic data instead, eg data some AI made up all by itself. This (in combination with smaller amounts of really clean/high quality ‘real’ data, e.g. not some random files stolen from your computer) is actually quite effective.

If you’re worried about privacy, the advertising ‘models’ companies like Facebook and Google already make, and have been making for over a decade, are closer to what you describe. It’s already happened, and we’ve been living with it for years.

Some of that is oldschool machine learning, but not all of it.

givesomefucks@lemmy.world · 12 days ago

So my theory is that with the help of telemetry or something else, AI can learn from data stored on users’ computers, meaning AI can steal your completed work, as well as your edits and corrections to your work, etc., even offline if you’re a Windows user for example.

No. It can’t do that.

slazer2au@lemmy.world · 12 days ago

All except this line is happening.

AI will literally steal your soul

And that can’t happen because it can’t steal what doesn’t exist.

zlatiah@lemmy.world · 12 days ago

I mean that is pretty much what AI bros want to do… and/or maybe already doing

From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data; the more high-quality (a.k.a. human-generated and not complete shitposts) training data, the better. What people write on their computers would probably overwhelmingly be high quality. That means, without major technological advancements… if AI companies have access to the types of contents you just described, it is very much in their interests to use them

I don’t 100% agree with this view, but if you subscribe to Prof. Emily M. Bender’s thought of seeing AI models as plagiarism machines, maybe you can say that AI is “stealing your soul”

brucethemoose@lemmy.world · edit-2 12 days ago

From a researcher/developer perspective: the biggest bottleneck that affects current-gen AI is the lack of high quality training data

I don’t completely agree with this. Recent papers have been working miracles with synthetic data generation and smaller datasets (eg, Phi).

Meanwhile, there’s a lot of speculation that Llama4 failed because Meta’s ‘real’ data was vast but not ‘smart,’ with hints via lines like this:

In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency.

Whereas Deepseek, with a very similar architecture and size, wrote about how well synthetic data worked in their GRPO paper.

And this keeps happening. As an example, Kimi Linear is (subjectively) performing very well in spite of its ‘small’ training dataset: https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct

IMO the limiting factor seems to be GPU time, dev time, and willingness to ‘experiment’ with exotic architectures, optimizations, and more specialized models (including the burnt time/cash on experiments that don’t work).

MajorHavoc@programming.dev · 12 days ago

That is, fundamentally, what some of us figure the long term plan is with Microsoft Recall.

It came with various guarantees of privacy, the first time they tried it.

But they know no one reads changes to terms of service.

The sad part is that I fully expect that to be the default reality in a few years: a Microsoft model training on every keystroke and click on every copy of Windows 11/12.

SGGeorwell@lemmy.world · 12 days ago

There’s always pen and paper.