Arxiv Papers

Key-value caching in large language models is crucial for decoding speed. Multi-Query Attention (MQA) and Cross-Layer Attention (CLA) reduce memory usage while maintaining accuracy, enabling larger models. https://arxiv.org/abs//2405.12981 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:08

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Duration:00:18:08

[QA] Your Transformer is Secretly Linear

The paper uncovers a linear characteristic in transformer decoders, showing a near-perfect linear relationship between embedding transformations in sequential layers. Regularization reduces linearity and improves performance. https://arxiv.org/abs//2405.12250 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:28

Your Transformer is Secretly Linear

Duration:00:06:14

[QA] Training Data Attribution via Approximate Unrolled Differentation

The paper introduces SOURCE, a computationally efficient training data attribution method that combines implicit differentiation and unrolling approaches, outperforming existing techniques in counterfactual prediction. https://arxiv.org/abs//2405.12186 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:24

Training Data Attribution via Approximate Unrolled Differentation

Duration:00:27:49

[QA] Information Leakage from Embedding in Large Language Models

Study investigates privacy risks in large language models. Proposes methods to reconstruct user inputs from embeddings, introduces defense mechanism to safeguard privacy in distributed learning systems. https://arxiv.org/abs//2405.11916 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:30

Information Leakage from Embedding in Large Language Models

Duration:00:12:01

[QA] Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Proposed method reduces memory consumption in large language models by caching KVs of a small number of layers, improving throughput by up to 26% with competitive performance. https://arxiv.org/abs//2405.10637 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:12

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Duration:00:09:38

[QA] Observational Scaling Laws and the Predictability of Language Model Performance

5/19/2024

Proposing an observational approach to understand language model scaling laws using 80 public models, showing predictability of model performance and emergent phenomena without extensive training across scales. https://arxiv.org/abs//2405.10938 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:09:45

Observational Scaling Laws and the Predictability of Language Model Performance

5/19/2024

Duration:00:25:03

[QA] Zero-Shot Tokenizer Transfer

This paper introduces Zero-Shot Tokenizer Transfer (ZeTT) to enable swapping tokenizers in language models, improving efficiency across languages and coding tasks without performance degradation. https://arxiv.org/abs//2405.07883 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:09:52

Zero-Shot Tokenizer Transfer

Duration:00:13:38

[QA] Many-Shot In-Context Learning in Multimodal Foundation Models

Multimodal foundation models show improved performance in many-shot in-context learning, with Gemini 1.5 Pro demonstrating higher efficiency and scalability compared to GPT-4o. https://arxiv.org/abs//2405.09798 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:11:38

Many-Shot In-Context Learning in Multimodal Foundation Models

Duration:00:10:06

[QA] Chameleon: Mixed-Modal Early-Fusion Foundation Models

https://arxiv.org/abs//2405.09818 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:59

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Duration:00:19:54

[QA] LoRA Learns Less and Forgets Less

LoRA is a parameter-efficient finetuning method for large language models, but underperforms full finetuning in most cases. It offers strong regularization and diverse generations. https://arxiv.org/abs//2405.09673 YouTube: https://www.youtube.com/@ArxivPapers TikTok: https://www.tiktok.com/@arxiv_papers Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016 Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers --- Support this podcast: https://podcasters.spotify.com/pod/show/arxiv-papers/support

Duration:00:08:49

LoRA Learns Less and Forgets Less