LLMZip: Lossless Text Compression using Large Language Models

Chandra Shekhara Kaushik Valmeekam,Krishna Narayanan,Dileep Kalathil,Jean-Francois Chamberland,Srinivas Shakkottai

CoRR（2023）

引用 0|浏览101

暂无评分

摘要

We provide new estimates of an asymptotic upper bound on the entropy of English using the large language model LLaMA-7B as a predictor for the next token given a window of past tokens. This estimate is significantly smaller than currently available estimates in \cite{cover1978convergent}, \cite{lutati2023focus}. A natural byproduct is an algorithm for lossless compression of English text which combines the prediction from the large language model with a lossless compression scheme. Preliminary results from limited experiments suggest that our scheme outperforms state-of-the-art text compression schemes such as BSC, ZPAQ, and paq8h.

查看译文

关键词

lossless text compression,large language models

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要