Understand operational systems in production
Summary
Over passed 7 weeks I have been practicing case studies of software system's work degradation caused by lack of optimization for OS. The goal of such practice was better understand operation systems, potential bottlenecks in production, how to *correctly* diagnose such issues to fix them.
This publication demonstrates 3 typical scenarios and steps to diagnose performance issues. Meanwhile it uncovers OS architecture details which play crucial role in software systems performance.
Disclaimer: such case studies were prepared and reviewed by ChatGpt. It use pure semi-academic experience which should be verified in real practice.
By "operational system," I don't mean Linux or Kubernetes.
It's a way of looking at a system at work:
how it behaves under load, where latency occurs,
and what mechanisms actually govern this behavior.
Scenario 1 — "There's a CPU, but it feels like it's not there"
Context
- Kubernetes cluster
- container with service (single-threaded, CPU-bound)
- node: 16 CPUs, 64 GB RAM
- stable load
Symptoms
In this text, "system" doesn't just refer to code.
It also refers to:
the application,
the operating system,
and the resources through which they interact (memory, disk, CPU).
- latency fluctuates (sometimes +50–70%)
- container CPU usage: ~60–70%
- inside the container:
top → the process doesn't reach 100%
- node CPU isn't fully utilized (~40–50%)
- sufficient memory
- almost no I/O
The intuitive mistake here is to look at individual metrics:
CPU, memory, disk.
But the system isn't "using the CPU." It's slowing down as a whole.
Questions
- What hypotheses will you come up with? (at least 3)
- What will you check first?
- What commands will you run?
- Where might the "hidden limit" be?
Discovery
What hypotheses will you come up with? (at least 3)?
- A possible quota on CPU usage on the c-group
- Host machine takes CPU time from container for itself or another container
- CPU is switch often between multiple process dropping cache and others associated resources
- Parts of the memory is associated on a different NUMA node
What will you check first?
It makes sense to start with checking limit for CPU usage setup on the container. Then check steal time and CPU context switches. Finally check NUMA node hypotheses.
What commands will you run?
To check throttling:
$ cat /sys/fs/cgroup/cpu.max
$ cat /sys/fs/cgroup/cpu.stat
Look for dynamic of nr_throttled & throttled_time
To check Kubernetes limits:
$ kubectl describe pod
Look for cpu limits & cpu requests
To check CPU steal and switch metrics:
$ top
To check scheduler:
$ pidstat -w 1
$ runqlat (bcc)
To check NUMA:
$ numactl --show
$ numastat
Where might the "hidden limit" be?
cgroup (cpu.max / cpu.cfs_quota_us)
Kubernetes: resources.limits.cpu
Scenario 2 — "There's memory, but the system is slow"
Context
- VM in the cloud
- 8 CPUs, 32 GB RAM
- Database (active write)
- SSD
Symptoms
- CPU ~30–40%
- Swap is almost unused
- There is free memory (~5–8 GB)
- but:
- latency is increasing
- transactions are slowing down
- sometimes there are stops of hundreds of milliseconds
Metrics:
- vmstat: wa → low
- vmstat: si/so → almost 0
- but: cat /proc/pressure/memory: some avg10=0.20
Questions
- Why is the system slow even with "free memory"?
- What does memory pressure mean?
- What kernel mechanisms can block a process?
- What should you check next?
Discovery
Why is the system slow even with "free memory"?
Even though there appears to be free memory (~5–8 GB), much of it may be occupied by reclaimable objects like page cache, buffers, or slabs. Processes can experience latency because the kernel may need to perform memory reclaim or compaction to satisfy allocations. Additionally, frequent flushing of dirty pages to disk and subsequent page faults can cause stop-the-world pauses of hundreds of milliseconds, slowing transactions even if CPU usage is moderate and swap is barely used.
What does memory pressure mean?
Memory pressure indicates the percentage of time processes spend waiting for memory to become available. It occurs not necessarily because the system is out of free memory, but because the kernel must reclaim pages (from cache, buffers, or slabs) or wait for writeback of dirty pages to free memory. In Linux, this is visible in `/proc/pressure/memory`—for example, `some avg10=0.20` means processes spent ~20% of the last 10 seconds waiting for memory.
What kernel mechanisms can block a process?
Key mechanisms include:
- Direct reclaim / kswapd: the kernel attempts to free pages to satisfy allocations.
- Page cache writeback: processes may wait while dirty pages are flushed to disk.
- Mutexes or I/O wait: processes can block on NVMe queue or filesystem locks, especially with synchronous writes.
- Pressure Stall Information (PSI): indicates how long processes wait due to memory, CPU, or I/O pressure.
What should you check next?
To investigate further:
- Disk queue depth and latency using
iostat -x 1to see if NVMe is a bottleneck. - Memory stalls and reclaim activity via
/proc/pressure/memoryandvmstat 1. - Amount of dirty pages with
cat /proc/meminfo | grep Dirtyto assess writeback impact. - Database-specific synchronous writes (fsync, commit) that could amplify latency.
- Slab and kernel memory usage to identify potential kernel-side bottlenecks.
Scenario 3 — "The database is killing the disk... but the disk is fast"
Context
- PostgreSQL
- NVMe SSD
- 16 CPUs, 64 GB RAM
- High write load
Symptoms
- CPU ~20–30%
- iowait jumps to 15–20%
- Query latency is unstable
Metrics:
iostat -x:
util: ~60%
await: jumps (1ms → 20ms)
/proc/meminfo:
Dirty: high
Writeback: increasing
Questions
- Why can NVMe have such latencies?
- What's happening with dirty pages?
- How are page cache and fsync related?
- Where is the bottleneck: disk or memory?
Discovery
Why can NVMe have such latencies?
NVMe itself is not inherently slow — typical latency is in tens to hundreds of microseconds. Observed spikes (1ms → 20ms) indicate not raw device latency, but queuing and system-level effects. In this scenario, latency is caused by:
- Writeback pressure: the kernel cannot flush dirty pages fast enough
- Queue buildup: bursts of writes create variable queue depth
- fsync stalls: processes wait for durability guarantees
- Write amplification: many small synchronous writes (e.g., WAL)
What's happening with dirty pages?
Dirty pages are memory pages modified in the page cache but not yet written to disk. In this case:
- High write workload → rapid accumulation of dirty pages
- Writeback starts, but cannot keep up → Writeback increases
- Dirty pages approach kernel limits (vm.dirty_ratio / dirty_bytes)
- Application writes are slowed down
- Processes are forced to wait for writeback
How are page cache and fsync related?
PostgreSQL writes data to the page cache first, not directly to disk. The flow is:
- Write → data enters page cache (becomes dirty)
- fsync() → forces the kernel to flush relevant dirty pages to disk
- Process waits until data is physically persisted
- fsync latency increases
- Query latency becomes unstable
Where is the bottleneck: disk or memory?
The bottleneck is not purely the disk nor purely memory, but their interaction. More precisely:
- The disk is not saturated (util ~60%), so raw throughput is not the limit
- Memory (page cache) is accumulating dirty data faster than it can be flushed
- The kernel writeback mechanism becomes the limiting factor
- Write throttling
- Latency spikes
- Increased iowait despite moderate disk utilization
The modern education and place of AI here
Broad vs Narrow specialisation