Check out LongGen and S2-Attention, simple and effective architectures that substantially reduce the KV cache overhead of long-context LLMs. We also released an efficient and easy-to-use CUDA kernel library for various types of sparse attention.