Video — Deep Dive: Optimizing LLM inference
Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and often deliver latency and throughput that are incompatible with your cost-performance objectives.
In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.