Algorithm based on LLMs doubles lossless data compression rates

NoSpotOfGround@lemmy.world · 18 days ago

Algorithm based on LLMs doubles lossless data compression rates

AbouBenAdhem@lemmy.world · edit-2 15 days ago

The basic idea behind the researchers’ data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end

Great… but if that’s the case, maybe the user should reconsider the usefulness of transmitting that data in the first place.

𝔻𝕒𝕧𝕖@lemmy.world · 18 days ago

Can’t wait to find hallucinated data in your uncompressed files.

andallthat@lemmy.world · edit-2 18 days ago

I tried reading the paper. There is a free preprint version on arxiv. This page (from the article linked by OP) also links the code they used and the data they tried compressing, in the end.

While most of the theory is above my head, the basic intuition is that compression improves if you have some level of “understanding” or higher-level context of the data you are compressing. And LLMs are generally better at doing that than numeric algorithms.

As an example if you recognize a sequence of letters as the first chapter of the book Moby-Dick you’ll probably transmit that information more efficiently than a compression algorithm. “The first chapter of Moby-Dick”; there … I just did it.

besselj@lemmy.ca · edit-2 18 days ago

So if I have two machines running the same local LLM and I pass a prompt between them, I’ve achieved data compression by transmitting the prompt rather than the LLM’s expected response to the prompt? That’s what I’m understanding from the article.

Neat idea, but what if you want to transmit some information that an LLM can’t tokenize and generate accurately?

taladar@sh.itjust.works · 18 days ago

And how do I get the prompt that will reliably generate the data from the data? Usually for compression we do not start from an already compressed version.

Alphane Moon@lemmy.world · edit-2 18 days ago

I found the article to be rather confusing.

One thing to point out is that the video codec used in this research (but for which results weren’t published for some reason), H264, is not at all state of the art.

H265 is far newer and they are already working on H266. There are also other much higher quality codecs such as AV1. For what it’s worth, they do reference H265, but I don’t have access to the source research paper, so it’s difficult to say what they are comparing against.

The performance relative to FLAC is interesting though.

paraphrand@lemmy.world · 18 days ago

I wonder what the practical reasons for starting with h.264 are.

tekato@lemmy.world · 18 days ago

Interesting how they forgot to go over the architecture for LMDecompress.

fluxion@lemmy.world · 18 days ago

Middle-LLM compression

Kalvin@lemmy.world · 18 days ago

Removed by mod