1In a CPU flame graph, what does the width of a rectangle represent?
Profiling is the practice of measuring where a program spends its time and resources (CPU, memory, I/O) to identify performance bottlenecks. Flame graphs are a visualization technique that makes profiling data intuitive by showing the call stack hierarchy and the relative cost of each function, enabling engineers to quickly pinpoint hot paths.
The most expensive performance mistake is optimizing the wrong thing. Engineers routinely spend days optimizing a function they believe is slow, only to discover that the real bottleneck is elsewhere -- a database query, a network call, or a lock contention issue. Profiling eliminates guesswork by measuring exactly where time and resources are consumed. The golden rule of performance engineering is: never optimize without profiling first.
There are several types of profiling. CPU profiling measures which functions consume the most processor time, using either sampling (periodically recording the call stack) or instrumentation (inserting measurement code at function entry/exit). Memory profiling tracks allocations and heap usage to find memory leaks and allocation-heavy code paths. I/O profiling measures time spent waiting for disk or network operations. Lock/contention profiling identifies where threads are blocked waiting for mutexes or other synchronization primitives. Each type reveals different bottlenecks, and a production performance issue might require multiple profiling approaches.
Flame graphs, invented by Brendan Gregg at Netflix in 2011, revolutionized how engineers read profiling data. A flame graph is a visualization where the x-axis represents the population of stack traces (wider = more samples = more time spent), and the y-axis represents stack depth (bottom is the root, top is the leaf function). Each rectangle is a function in the call stack. The width of a rectangle shows how much time that function (and its children) consumed. To find bottlenecks, look for wide plateaus at the top of the graph -- these are leaf functions consuming significant CPU time.
Modern profiling tools make production profiling practical with minimal overhead. Linux perf captures CPU profiles with under 1% overhead. Go's pprof provides built-in CPU and memory profiling. Java Flight Recorder (JFR) captures detailed profiles continuously in production. Async-profiler handles Java without safepoint bias. Continuous profiling platforms (Google Cloud Profiler, Datadog Continuous Profiler, Pyroscope) aggregate profiles across fleet-wide deployments, enabling engineers to see system-wide hot paths rather than single-instance snapshots.
Reading a CPU Flame Graph
Imagine a web server with a flame graph showing: at the bottom is main() spanning the full width (100% of CPU). Above it, handleRequest() spans 80% and backgroundTask() spans 20%. Within handleRequest(), parseJSON() spans 5%, queryDatabase() spans 15%, and renderTemplate() spans 60%. Within renderTemplate(), escapeHTML() spans 45%. The bottleneck is immediately visible: escapeHTML() inside renderTemplate() consumes 45% of all CPU time. Optimizing this single function -- perhaps by caching escaped strings or using a faster escape implementation -- would nearly halve CPU usage. Without the flame graph, you might have wasted time optimizing queryDatabase() (only 15%) or parseJSON() (only 5%).
Netflix
Brendan Gregg created flame graphs while working at Netflix to diagnose performance issues in their Java-based streaming infrastructure. A single flame graph revealed that 30% of CPU time was spent in a logging framework's string formatting code that was being called on every request even when the log level was disabled. Fixing this one-line issue reduced CPU usage fleet-wide by 30%.
Uber
Uber uses continuous profiling across their Go microservices fleet with pprof. They discovered that a commonly used serialization library was allocating excessive temporary objects, causing GC pauses that contributed to p99 latency spikes. Memory flame graphs pinpointed the allocation sites, and switching to a zero-allocation serializer reduced p99 latency by 40% across hundreds of services.
LinkedIn deployed async-profiler across their Java services to capture CPU profiles without safepoint bias (a problem where standard JVM profilers miss CPU time spent between safepoints). They discovered that a regex-based input validation function, invisible to their previous profiler, was consuming 15% of CPU on their feed service. Replacing the regex with a hand-written parser eliminated the hotspot.
| Aspect | Description |
|---|---|
| Sampling vs Instrumentation Profiling | Sampling profilers (perf, pprof) have low overhead (1-5%) and are production-safe but may miss short-lived functions. Instrumentation profilers (gprof, manual timing) capture every call but add 10-50% overhead. For production use, sampling is almost always preferred; instrumentation is reserved for focused debugging in development. |
| CPU vs Off-CPU Profiling | CPU profiling reveals compute-bound bottlenecks but is blind to I/O waits, lock contention, and network delays. Off-CPU profiling captures blocking time but generates much more data and requires kernel-level tracing (eBPF, ftrace). Start with CPU profiling; switch to off-CPU when CPU utilization is low but latency is high. |
| Production vs Development Profiling | Production profiling captures real workloads but requires low-overhead tools and careful deployment. Development profiling allows heavier instrumentation but may miss production-only issues (different data sizes, concurrency patterns, cache behavior). The best practice is continuous low-overhead production profiling supplemented by focused development profiling. |
| Single-Instance vs Fleet-Wide Profiling | Profiling a single instance gives detailed per-request visibility but may not be representative. Fleet-wide continuous profiling (Google Cloud Profiler, Pyroscope) aggregates across all instances but requires infrastructure and storage for profile data. Fleet-wide profiling catches issues that only appear on specific instance types or traffic patterns. |
Netflix's 30% CPU Reduction via Flame Graph Analysis
Scenario
Netflix's streaming API services were running at higher CPU utilization than expected after a routine library upgrade. Traditional monitoring showed increased CPU usage but could not pinpoint the cause. The affected services had thousands of code paths, and code review of the library diff was inconclusive -- the changes seemed minor and unrelated to performance.
Solution
Brendan Gregg captured CPU flame graphs from production instances using Linux perf and his FlameGraph scripts. The flame graph immediately revealed a wide plateau in a logging library's string formatting function. The library upgrade had changed a log-level check from a fast integer comparison to a method call that constructed a formatted string before checking whether the log level was enabled. Even though the log message was never emitted (the level was disabled), the string formatting consumed 30% of CPU time on every request.
Outcome
A one-line fix -- adding an early return before string formatting when the log level was disabled -- reduced CPU usage by 30% across the entire streaming fleet. This saved Netflix millions of dollars in EC2 costs annually. The incident became a canonical example of why profiling is essential: the bug was invisible to code review, unit tests, and traditional monitoring. Only flame graph visualization made the bottleneck immediately obvious.
See Profiling & Flame Graphs in action
Explore system design templates that use profiling & flame graphs and run traffic simulations to see how these concepts perform under real load.
Browse Templates1In a CPU flame graph, what does the width of a rectangle represent?
2When should you use off-CPU flame graphs instead of CPU flame graphs?