Vetora logo
🔥Performance

Profiling & Flame Graphs

Profiling is the practice of measuring where a program spends its time and resources (CPU, memory, I/O) to identify performance bottlenecks. Flame graphs are a visualization technique that makes profiling data intuitive by showing the call stack hierarchy and the relative cost of each function, enabling engineers to quickly pinpoint hot paths.

Overview

The most expensive performance mistake is optimizing the wrong thing. Engineers routinely spend days optimizing a function they believe is slow, only to discover that the real bottleneck is elsewhere -- a database query, a network call, or a lock contention issue. Profiling eliminates guesswork by measuring exactly where time and resources are consumed. The golden rule of performance engineering is: never optimize without profiling first.

There are several types of profiling. CPU profiling measures which functions consume the most processor time, using either sampling (periodically recording the call stack) or instrumentation (inserting measurement code at function entry/exit). Memory profiling tracks allocations and heap usage to find memory leaks and allocation-heavy code paths. I/O profiling measures time spent waiting for disk or network operations. Lock/contention profiling identifies where threads are blocked waiting for mutexes or other synchronization primitives. Each type reveals different bottlenecks, and a production performance issue might require multiple profiling approaches.

Flame graphs, invented by Brendan Gregg at Netflix in 2011, revolutionized how engineers read profiling data. A flame graph is a visualization where the x-axis represents the population of stack traces (wider = more samples = more time spent), and the y-axis represents stack depth (bottom is the root, top is the leaf function). Each rectangle is a function in the call stack. The width of a rectangle shows how much time that function (and its children) consumed. To find bottlenecks, look for wide plateaus at the top of the graph -- these are leaf functions consuming significant CPU time.

Modern profiling tools make production profiling practical with minimal overhead. Linux perf captures CPU profiles with under 1% overhead. Go's pprof provides built-in CPU and memory profiling. Java Flight Recorder (JFR) captures detailed profiles continuously in production. Async-profiler handles Java without safepoint bias. Continuous profiling platforms (Google Cloud Profiler, Datadog Continuous Profiler, Pyroscope) aggregate profiles across fleet-wide deployments, enabling engineers to see system-wide hot paths rather than single-instance snapshots.

Key Points
  • 1CPU flame graphs show where CPU time is spent. Wide plateaus at the top indicate hot functions. Narrow towers indicate deep but fast call chains. The width of each frame is proportional to the number of samples in which that function appeared on the stack.
  • 2Off-CPU flame graphs show where threads are blocked -- waiting for I/O, locks, network responses, or sleep. These complement CPU flame graphs: if CPU utilization is low but latency is high, the bottleneck is off-CPU, and only off-CPU profiling will reveal it.
  • 3Sampling profilers (perf, pprof, async-profiler) record the call stack periodically (e.g., 99 times per second). They have low overhead (1-5%) and are safe for production. Instrumentation profilers (gprof, JProfiler in instrumentation mode) add code at every function boundary, providing exact counts but with 10-50% overhead.
  • 4Always profile in production or with production-realistic workloads. Development workloads differ from production in data volume, concurrency patterns, cache hit rates, and GC pressure. A profile from a local dev environment may be completely misleading.
  • 5Differential flame graphs compare two profiles (before/after a deployment, or baseline vs slow period) by showing only the differences. Red indicates functions that got slower; blue indicates functions that got faster. This is invaluable for regression analysis.
  • 6Memory flame graphs show allocation sites rather than CPU time. Each frame's width represents bytes allocated (not time spent). These reveal allocation-heavy code paths that create GC pressure, even if the allocations themselves are fast.
Simple Example

Reading a CPU Flame Graph

Imagine a web server with a flame graph showing: at the bottom is main() spanning the full width (100% of CPU). Above it, handleRequest() spans 80% and backgroundTask() spans 20%. Within handleRequest(), parseJSON() spans 5%, queryDatabase() spans 15%, and renderTemplate() spans 60%. Within renderTemplate(), escapeHTML() spans 45%. The bottleneck is immediately visible: escapeHTML() inside renderTemplate() consumes 45% of all CPU time. Optimizing this single function -- perhaps by caching escaped strings or using a faster escape implementation -- would nearly halve CPU usage. Without the flame graph, you might have wasted time optimizing queryDatabase() (only 15%) or parseJSON() (only 5%).

Real-World Examples

Netflix

Brendan Gregg created flame graphs while working at Netflix to diagnose performance issues in their Java-based streaming infrastructure. A single flame graph revealed that 30% of CPU time was spent in a logging framework's string formatting code that was being called on every request even when the log level was disabled. Fixing this one-line issue reduced CPU usage fleet-wide by 30%.

Uber

Uber uses continuous profiling across their Go microservices fleet with pprof. They discovered that a commonly used serialization library was allocating excessive temporary objects, causing GC pauses that contributed to p99 latency spikes. Memory flame graphs pinpointed the allocation sites, and switching to a zero-allocation serializer reduced p99 latency by 40% across hundreds of services.

LinkedIn

LinkedIn deployed async-profiler across their Java services to capture CPU profiles without safepoint bias (a problem where standard JVM profilers miss CPU time spent between safepoints). They discovered that a regex-based input validation function, invisible to their previous profiler, was consuming 15% of CPU on their feed service. Replacing the regex with a hand-written parser eliminated the hotspot.

Trade-Offs
AspectDescription
Sampling vs Instrumentation ProfilingSampling profilers (perf, pprof) have low overhead (1-5%) and are production-safe but may miss short-lived functions. Instrumentation profilers (gprof, manual timing) capture every call but add 10-50% overhead. For production use, sampling is almost always preferred; instrumentation is reserved for focused debugging in development.
CPU vs Off-CPU ProfilingCPU profiling reveals compute-bound bottlenecks but is blind to I/O waits, lock contention, and network delays. Off-CPU profiling captures blocking time but generates much more data and requires kernel-level tracing (eBPF, ftrace). Start with CPU profiling; switch to off-CPU when CPU utilization is low but latency is high.
Production vs Development ProfilingProduction profiling captures real workloads but requires low-overhead tools and careful deployment. Development profiling allows heavier instrumentation but may miss production-only issues (different data sizes, concurrency patterns, cache behavior). The best practice is continuous low-overhead production profiling supplemented by focused development profiling.
Single-Instance vs Fleet-Wide ProfilingProfiling a single instance gives detailed per-request visibility but may not be representative. Fleet-wide continuous profiling (Google Cloud Profiler, Pyroscope) aggregates across all instances but requires infrastructure and storage for profile data. Fleet-wide profiling catches issues that only appear on specific instance types or traffic patterns.
Case Study

Netflix's 30% CPU Reduction via Flame Graph Analysis

Scenario

Netflix's streaming API services were running at higher CPU utilization than expected after a routine library upgrade. Traditional monitoring showed increased CPU usage but could not pinpoint the cause. The affected services had thousands of code paths, and code review of the library diff was inconclusive -- the changes seemed minor and unrelated to performance.

Solution

Brendan Gregg captured CPU flame graphs from production instances using Linux perf and his FlameGraph scripts. The flame graph immediately revealed a wide plateau in a logging library's string formatting function. The library upgrade had changed a log-level check from a fast integer comparison to a method call that constructed a formatted string before checking whether the log level was enabled. Even though the log message was never emitted (the level was disabled), the string formatting consumed 30% of CPU time on every request.

Outcome

A one-line fix -- adding an early return before string formatting when the log level was disabled -- reduced CPU usage by 30% across the entire streaming fleet. This saved Netflix millions of dollars in EC2 costs annually. The incident became a canonical example of why profiling is essential: the bug was invisible to code review, unit tests, and traditional monitoring. Only flame graph visualization made the bottleneck immediately obvious.

Common Mistakes
  • Optimizing without profiling first. Developers often guess which function is slow based on intuition, but guesses are wrong more than half the time. Profile first, then optimize the actual hotspot -- not the suspected one.
  • Profiling only in development environments with toy data sets. Production workloads have different data volumes, cache hit rates, and concurrency patterns. A function that is fast with 100 rows may be the bottleneck with 10 million rows.
  • Ignoring off-CPU time. If your CPU utilization is only 30% but latency is high, the bottleneck is I/O, lock contention, or network waits. CPU flame graphs will show nothing useful -- you need off-CPU flame graphs or tracing to find the real issue.
  • Taking a single profile snapshot and treating it as representative. Performance varies with time of day, traffic patterns, and cache state. Profile during multiple time windows, especially during peak traffic and after cache cold starts.
Related Concepts

See Profiling & Flame Graphs in action

Explore system design templates that use profiling & flame graphs and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Profile hot paths in e-commerce request processing

Metrics to watch
cpu_time_per_request_msgc_pause_msthread_pool_utilization_pctp99_latency_ms
Run Simulation
Test Your Understanding

1In a CPU flame graph, what does the width of a rectangle represent?

2When should you use off-CPU flame graphs instead of CPU flame graphs?

Deeper Reading