Understand operational systems in production

Summary

Over passed 7 weeks I have been practicing case studies of software system's work degradation caused by lack of optimization for OS. The goal of such practice was better understand operation systems, potential bottlenecks in production, how to *correctly* diagnose such issues to fix them.

This publication demonstrates 3 typical scenarios and steps to diagnose performance issues. Meanwhile it uncovers OS architecture details which play crucial role in software systems performance.

Disclaimer: such case studies were prepared and reviewed by ChatGpt. It use pure semi-academic experience which should be verified in real practice.

By "operational system," I don't mean Linux or Kubernetes.

It's a way of looking at a system at work:
how it behaves under load, where latency occurs, and what mechanisms actually govern this behavior.

Scenario 1 — "There's a CPU, but it feels like it's not there"

Context

Kubernetes cluster
container with service (single-threaded, CPU-bound)
node: 16 CPUs, 64 GB RAM
stable load

Symptoms

In this text, "system" doesn't just refer to code.

It also refers to:
the application,
the operating system,
and the resources through which they interact (memory, disk, CPU).

latency fluctuates (sometimes +50–70%)
container CPU usage: ~60–70%
inside the container:

                        top → the process doesn't reach 100%

at the same time:
- node CPU isn't fully utilized (~40–50%)
- sufficient memory
- almost no I/O

The intuitive mistake here is to look at individual metrics:
CPU, memory, disk.

But the system isn't "using the CPU." It's slowing down as a whole.

Questions

What hypotheses will you come up with? (at least 3)
What will you check first?
What commands will you run?
Where might the "hidden limit" be?

Discovery

What hypotheses will you come up with? (at least 3)?

A possible quota on CPU usage on the c-group
Host machine takes CPU time from container for itself or another container
CPU is switch often between multiple process dropping cache and others associated resources
Parts of the memory is associated on a different NUMA node

What will you check first?

It makes sense to start with checking limit for CPU usage setup on the container. Then check steal time and CPU context switches. Finally check NUMA node hypotheses.

What commands will you run?

To check throttling:

                    $ cat /sys/fs/cgroup/cpu.max
                    $ cat /sys/fs/cgroup/cpu.stat

Look for dynamic of nr_throttled & throttled_time

To check Kubernetes limits:

                    $ kubectl describe pod

Look for cpu limits & cpu requests

To check CPU steal and switch metrics:

                    $ top

To check scheduler:

                    $ pidstat -w 1
                    $ runqlat (bcc)

To check NUMA:

                    $ numactl --show
                    $ numastat

Where might the "hidden limit" be?

cgroup (cpu.max / cpu.cfs_quota_us)
Kubernetes: resources.limits.cpu

Scenario 2 — "There's memory, but the system is slow"

Context

VM in the cloud
8 CPUs, 32 GB RAM
Database (active write)
SSD

Symptoms

CPU ~30–40%
Swap is almost unused
There is free memory (~5–8 GB)
but:

latency is increasing
transactions are slowing down
sometimes there are stops of hundreds of milliseconds

Metrics:

vmstat: wa → low
vmstat: si/so → almost 0
but: cat /proc/pressure/memory: some avg10=0.20

Questions

Why is the system slow even with "free memory"?
What does memory pressure mean?
What kernel mechanisms can block a process?
What should you check next?

Discovery

Why is the system slow even with "free memory"?

Even though there appears to be free memory (~5–8 GB), much of it may be occupied by reclaimable objects like page cache, buffers, or slabs. Processes can experience latency because the kernel may need to perform memory reclaim or compaction to satisfy allocations. Additionally, frequent flushing of dirty pages to disk and subsequent page faults can cause stop-the-world pauses of hundreds of milliseconds, slowing transactions even if CPU usage is moderate and swap is barely used.

What does memory pressure mean?

Memory pressure indicates the percentage of time processes spend waiting for memory to become available. It occurs not necessarily because the system is out of free memory, but because the kernel must reclaim pages (from cache, buffers, or slabs) or wait for writeback of dirty pages to free memory. In Linux, this is visible in `/proc/pressure/memory`—for example, `some avg10=0.20` means processes spent ~20% of the last 10 seconds waiting for memory.

What kernel mechanisms can block a process?

Key mechanisms include:

Direct reclaim / kswapd: the kernel attempts to free pages to satisfy allocations.
Page cache writeback: processes may wait while dirty pages are flushed to disk.
Mutexes or I/O wait: processes can block on NVMe queue or filesystem locks, especially with synchronous writes.
Pressure Stall Information (PSI): indicates how long processes wait due to memory, CPU, or I/O pressure.

What should you check next?

To investigate further:

Disk queue depth and latency using iostat -x 1 to see if NVMe is a bottleneck.
Memory stalls and reclaim activity via /proc/pressure/memory and vmstat 1.
Amount of dirty pages with cat /proc/meminfo | grep Dirty to assess writeback impact.
Database-specific synchronous writes (fsync, commit) that could amplify latency.
Slab and kernel memory usage to identify potential kernel-side bottlenecks.

Scenario 3 — "The database is killing the disk... but the disk is fast"

Context

PostgreSQL
NVMe SSD
16 CPUs, 64 GB RAM
High write load

Symptoms

CPU ~20–30%
iowait jumps to 15–20%
Query latency is unstable

Metrics:

                    iostat -x:

                        util: ~60%
                        await: jumps (1ms → 20ms)
                    /proc/meminfo:

                        Dirty: high
                        Writeback: increasing

Questions

Why can NVMe have such latencies?
What's happening with dirty pages?
How are page cache and fsync related?
Where is the bottleneck: disk or memory?

Discovery

Why can NVMe have such latencies?

NVMe itself is not inherently slow — typical latency is in tens to hundreds of microseconds. Observed spikes (1ms → 20ms) indicate not raw device latency, but queuing and system-level effects. In this scenario, latency is caused by:

Writeback pressure: the kernel cannot flush dirty pages fast enough
Queue buildup: bursts of writes create variable queue depth
fsync stalls: processes wait for durability guarantees
Write amplification: many small synchronous writes (e.g., WAL)

As a result, NVMe appears "slow", but the delay comes from the interaction between the workload (PostgreSQL) and the kernel writeback subsystem, not from the hardware itself.

What's happening with dirty pages?

Dirty pages are memory pages modified in the page cache but not yet written to disk. In this case:

High write workload → rapid accumulation of dirty pages
Writeback starts, but cannot keep up → Writeback increases
Dirty pages approach kernel limits (vm.dirty_ratio / dirty_bytes)

When limits are reached, the kernel activates write throttling (balance_dirty_pages):

Application writes are slowed down
Processes are forced to wait for writeback

Dirty pages are not lost — they must eventually be written to disk. The system trades latency for data integrity and stability.

How are page cache and fsync related?

PostgreSQL writes data to the page cache first, not directly to disk. The flow is:

Write → data enters page cache (becomes dirty)
fsync() → forces the kernel to flush relevant dirty pages to disk
Process waits until data is physically persisted

If writeback is lagging:

fsync latency increases
Query latency becomes unstable

Thus, fsync couples application latency to the state of the kernel writeback subsystem.

Where is the bottleneck: disk or memory?

The bottleneck is not purely the disk nor purely memory, but their interaction. More precisely:

The disk is not saturated (util ~60%), so raw throughput is not the limit
Memory (page cache) is accumulating dirty data faster than it can be flushed
The kernel writeback mechanism becomes the limiting factor

This leads to:

Write throttling
Latency spikes
Increased iowait despite moderate disk utilization

Therefore, the bottleneck is the writeback pipeline (memory → disk), not a single component in isolation.

A few related publications:

The modern education and place of AI here
Broad vs Narrow specialisation