Decoding LLM Speed: The Secret Metrics Behind Inference Performance
Elijah TobsBy Elijah Tobs
Tech
May 30, 2026 • 2:14 AM
9m9 min read
Verified
Source: Pexels
The Core Insight
This guide demystifies the mechanics of LLM inference, breaking down the two-phase generation process, prefill and decode, and the essential metrics required to measure performance. It explains why LLMs are compute-bound during input processing and memory-bandwidth-bound during token generation, providing a foundation for optimizing real-world AI applications.
As the founder and primary investigative voice at Kodawire, Elijah Tobs brings over 15 years of experience in dissecting complex geopolitical and financial systems. His work is centered on the ethical governance of emerging technologies, the shifting architectures of global finance, and the future of pedagogy in a digital-first world. A staunch advocate for high-fidelity journalism, he established Kodawire to be a sanctuary for deep-dive intelligence. Moving away from the ephemeral nature of modern headlines, Kodawire delivers permanent, verified insights that challenge the status quo and empower the global reader.
The Hidden Cost of AI: Why Inference Optimization Matters
In the rush to deploy large language models, many teams focus almost exclusively on fine-tuning and model architecture. However, once you move from a research notebook to a production environment, the bottleneck shifts. It is no longer about how well your model learns; it is about how efficiently it serves. If you are building applications that rely on real-time responses, you are likely hitting the "memory wall" without even realizing it. For those moving beyond the notebook, understanding these constraints is the first step toward stability.
The Bottom Line
Measure the right things: Don't just look at average latency. Focus on p95/p99 tail latencies and "Goodput" to ensure a consistent user experience.
Understand the phases: Inference is split into a compute-bound Prefill phase and a memory-bandwidth-bound Decode phase.
Don't trust raw TPS: Because different models use different tokenizers, comparing "Tokens Per Second" across models can be misleading.
Optimize for the use case: Batch processing favors throughput, while interactive chatbots demand low TTFT.
I have spent years watching teams struggle with production deployments, and the most common mistake is treating inference as a black box. To truly optimize, you have to look under the hood at how these models actually process data. After digging into the mechanics of autoregressive generation, it becomes clear that performance isn't just about raw GPU power, it is about how you manage the flow of data through the hardware. If you are struggling with model degradation, your inference strategy is likely the culprit.
Modern GPU hardware is the engine behind LLM inference, but software optimization determines how effectively that power is utilized. (Credit: Sergei Starostin via Pexels)
How I Researched This
My analysis is based on a deep dive into the mechanics of autoregressive inference. I have vetted the standard performance metrics, TTFT, TPOT, and E2E, against the realities of modern GPU utilization. I have cross-referenced the two-phase inference architecture (Prefill vs. Decode) to ensure the technical distinctions between compute-bound and memory-bound operations are accurate. This is a breakdown of the fundamental constraints that dictate whether your application feels responsive or sluggish.
Essential Metrics for Measuring LLM Performance
If you aren't measuring, you aren't optimizing. Most developers start with average latency, but that is a trap. A system that performs well on average but fails 5% of the time is a broken system in production. Implementing a robust MLOps observability stack is essential to catching these issues before they impact users.
Time to First Token (TTFT): This is your "startup latency." It measures how long a user waits before seeing the first character of a response.
Time per Output Token (TPOT): Once the engine is running, this measures the steady-state speed. It is the inverse of your generation speed.
End-to-End Latency (E2E): The total time from the initial request to the final token.
Throughput (RPS/TPS): Requests per second (RPS) is useful for load testing, but Tokens per second (TPS) is the industry standard for LLM performance. Note: Be careful here. Because different tokenizers map tokens to characters differently, a higher TPS on one model doesn't always mean it is "faster" in terms of actual content delivery.
Latency Percentiles (p95, p99): These capture the "tail-end" experience. If your p99 is 2 seconds while your average is 200ms, your users are having a bad time.
Goodput: This is the gold standard. It measures the percentage of requests that meet all your SLOs (e.g., TTFT < 500ms AND TPOT < 50ms) simultaneously.
The Other Side of the Story
Most industry experts obsess over "Tokens Per Second" as the ultimate benchmark. I disagree. Focusing on TPS is often a vanity metric that ignores the user's actual experience. A model that generates 100 tokens per second but has a 3-second TTFT will feel significantly slower to a human user than a model that generates 40 tokens per second with a near-instant TTFT. Stop optimizing for the machine's speed and start optimizing for the human's perception.
To understand why inference is so difficult to optimize, you have to look at the autoregressive nature of these models. They generate tokens one by one, and each new token depends on everything that came before it. This creates two distinct operational phases:
The Prefill Phase: Think of this as "reading the book." The model processes your entire input prompt at once. Because the input is known, the GPU can parallelize this into massive matrix-matrix operations. It is compute-bound, meaning the GPU is working at full capacity. During this phase, the model builds the KV cache, a memory structure that stores intermediate states to avoid re-calculating everything later.
The Decode Phase: This is "writing the book." The model generates one token at a time. It takes the new token, updates the KV cache, and performs a matrix-vector operation. This is incredibly inefficient for hardware because it is memory-bandwidth-bound. You are moving massive amounts of data for a tiny calculation. This is where your TPOT is determined.
Efficient inference requires managing the memory bandwidth constraints of your server infrastructure. (Credit: Markus Winkler via Pexels)
The Hands-On Experience
When I test inference performance, I look for the "knee" in the latency curve. Using standard benchmarking tools, I monitor GPU utilization during the Prefill phase versus the Decode phase. If your GPU utilization drops off a cliff during generation, you are likely hitting a memory bandwidth bottleneck. I recommend testing with a variety of prompt lengths, as the Prefill phase scales differently than the Decode phase. For those looking to optimize further, consider knowledge distillation to reduce the model footprint.
The Decision Matrix
Not sure where to focus your optimization efforts? Use this simple guide:
If you are building a Chatbot: Prioritize TTFT. Users will forgive a slow generation speed if the response starts immediately.
If you are doing Batch Processing: Prioritize Throughput (TPS). Latency matters less than the total time to process the entire dataset.
If you are building a Real-time Agent: Prioritize Goodput. You need consistent performance across both TTFT and TPOT to keep the agent responsive.
Future-Proofing Your Setup
The industry is moving toward techniques like speculative decoding and KV cache quantization to mitigate the memory-bandwidth bottleneck. If you are building for the long term, ensure your inference engine supports these features. Relying on raw, unoptimized inference will become increasingly expensive as models grow in size and context window requirements. Proper Kubernetes orchestration can help manage these scaling demands effectively.
Optimizing inference is a continuous process of monitoring, testing, and refining your deployment architecture. (Credit: Pramod Tiwari via Pexels)
Tools I Actually Use
vLLM: Currently the gold standard for high-throughput serving with PagedAttention.
TensorRT-LLM: Essential if you are locked into NVIDIA hardware and need maximum performance tuning.
Prometheus/Grafana: I use these to track p99 latencies in real-time. If you aren't visualizing your tail latencies, you are flying blind.
What Do You Think?
We have covered the technical reality of why inference is a two-phase struggle, but I want to hear from your experience in the field. When you are deploying models, do you find that your users complain more about the initial wait time (TTFT) or the speed of the text appearing on the screen (TPOT)? I will be replying to every comment in the next 24 hours.
The Prefill phase is compute-bound and involves processing the input prompt to build the KV cache. The Decode phase is memory-bandwidth-bound and involves generating tokens one by one.
TPS can be misleading because different tokenizers map tokens to characters differently, and a high TPS does not always correlate with a fast Time to First Token (TTFT), which is often more important for user experience.
Goodput is a performance metric that measures the percentage of requests that simultaneously meet all defined Service Level Objectives (SLOs), such as specific thresholds for both TTFT and TPOT.
Active Engagement
Was this information helpful?
Join Discussions
0 Thoughts
Editorial Team • Question of the Day
"Does your current production environment prioritize TTFT or overall throughput, and why?"