LCLMs compress LLM context before decode — 8.8x faster at 16x compression, beating every KV cache method tested. Open-sourced by NYU and Columbia.
Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the model ...