Introduction
Understanding process management and performance monitoring is essential for maintaining healthy systems. When applications slow down or servers become unresponsive, you need tools and techniques to diagnose issues quickly. This guide covers practical approaches to monitoring, troubleshooting, and optimizing Linux system performance.
Process Monitoring
Basic Commands
# List all processes
ps aux
# Process tree (parent-child relationships)
ps auxf
pstree -p
# Real-time process monitoring
top
htop # More user-friendly alternative
# Filter by user
ps -u username
# Filter by process name
ps aux | grep nginx
pgrep nginx
Understanding top/htop Output
top - 14:32:01 up 45 days, 2:15, 3 users, load average: 0.52, 0.58, 0.59
Tasks: 256 total, 1 running, 255 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.3 us, 1.2 sy, 0.0 ni, 93.1 id, 0.2 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 15842.3 total, 712.3 free, 7256.4 used, 7873.6 buff/cache
MiB Swap: 2048.0 total, 2040.0 free, 8.0 used. 8114.1 avail Mem
Key metrics:
- load average: 1/5/15-minute CPU load. Compare to the number of CPUs.
- us: User space CPU time
- sy: System/kernel CPU time
- id: Idle CPU time
- wa: Waiting for I/O
- buff/cache: Memory used for caching (available if needed)
htop Shortcuts
| Key | Action |
|---|
| F6 | Sort by column |
| F9 | Kill process |
| F4 | Filter by name |
| t | Tree view |
| H | Hide user threads |
| Space | Tag process |
Memory Management
Memory Usage
# Memory summary
free -h
# Detailed memory info
cat /proc/meminfo
# Memory by process
ps aux --sort=-%mem | head -20
# Detailed memory map for process
pmap <pid>
Understanding Memory
total used free shared buff/cache available
Mem: 15Gi 7.1Gi 696Mi 416Mi 7.7Gi 7.9Gi
Swap: 2.0Gi 8.0Mi 2.0Gi
- available > free because buff/cache can be reclaimed
- High buff/cache is normal and good (caching disk reads)
- Watch available memory, not free
Finding Memory Leaks
# Monitor process memory over time
while true; do
ps -o pid,vsz,rss,comm -p $(pgrep myapp)
sleep 60
done >> /var/log/memory-monitor.log
# Check if process memory keeps growing
watch -n 5 'ps -o pid,rss,command -p $(pgrep myapp)'
CPU Management
CPU Usage by Process
# Top CPU consumers
ps aux --sort=-%cpu | head -20
# CPU usage per core
mpstat -P ALL 1
# Real-time CPU monitoring
htop
# Press F2 → Display options → Check "Detailed CPU time"
Identifying CPU Bottlenecks
# Check for CPU-bound processes
top
# Look for processes with high %CPU
# Check for I/O wait
vmstat 1
# High wa (wait) indicates an I/O bottleneck, not CPU
# Process strace for debugging
strace -c -p <pid> # System call statistics
Disk I/O Monitoring
Real-time I/O
# I/O by process
iotop
iotop -oP # Only processes doing I/O
# Disk activity
iostat -xz 1
# I/O wait time
vmstat 1
# Watch the "wa" column
Understanding iostat
iostat -xz 1
Device r/s w/s rMB/s wMB/s %util
sda 45.00 120.00 1.80 48.00 85.00
- %util > 80%: Disk is becoming a bottleneck
- r/s, w/s: Reads and writes per second
- await: Average wait time (ms)—high = problem
Find Large Files
# Find files over 1GB
find /var -type f -size +1G
# Disk usage by directory
du -sh /*
du -sh /var/*
# Interactive disk usage
ncdu /var
Process Control
Signals
# Graceful termination (SIGTERM)
kill <pid>
kill -15 <pid>
# Force kill (SIGKILL)
kill -9 <pid>
# Kill by name
pkill nginx
killall nginx
# Send HUP (reload config)
kill -HUP <pid>
# Common signals:
# SIGTERM (15): Graceful shutdown
# SIGKILL (9): Force kill (cannot be caught)
# SIGHUP (1): Hangup/reload
# SIGSTOP (19): Pause process
# SIGCONT (18): Resume process
Process Priority
# Start with lower priority (nicer)
nice -n 10 ./long-running-script.sh
# Change running process priority
renice -n 10 -p <pid>
# Priority range: -20 (highest) to 19 (lowest)
# Only root can set negative nice values
Background Processes
# Run in background
./script.sh &
# Move current process to background
Ctrl+Z # Suspend
bg # Resume in background
# Bring to foreground
fg
# List background jobs
jobs
# Keep running after logout
nohup ./script.sh &
# Output goes to nohup.out
# Or use disown
./script.sh &
disown %1
System Resource Limits
View Limits
# Current shell limits
ulimit -a
# Process limits
cat /proc/<pid>/limits
Set Limits
# Temporary (current session)
ulimit -n 65535 # Max open files
# Permanent: /etc/security/limits.conf
# <user> <type> <item> <value>
www-data soft nofile 65535
www-data hard nofile 65535
* soft nproc 4096
Monitoring Tools
vmstat - Virtual Memory Statistics
vmstat 1
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa st
# 1 0 0 712340 125612 7935460 0 0 1 45 102 234 5 1 93 1 0
- r: Running processes (> CPUs = overload)
- b: Blocked processes (waiting for I/O)
- si/so: Swap in/out (should be 0)
- bi/bo: Block I/O
sar - System Activity Reporter
# Install sysstat for sar
sudo apt install sysstat
# CPU history
sar -u
# Memory history
sar -r
# Disk I/O history
sar -d
# Network history
sar -n DEV
dstat - Comprehensive Statistics
# Install
sudo apt install dstat
# All-in-one monitoring
dstat -cdngy
# CPU, disk, network, paging, system
Troubleshooting Workflows
High CPU Usage
# 1. Identify the process
top -c
# Note PID of high-CPU process
# 2. Check what it's doing
strace -p <pid> 2>&1 | head -50
# 3. Check if it's using all cores
mpstat -P ALL 1
# 4. Profile if needed (for your own apps)
perf top -p <pid>
High Memory Usage
# 1. Check overall memory
free -h
# 2. Find memory hogs
ps aux --sort=-%mem | head -10
# 3. Check for memory leaks
pmap <pid> | tail -1
# 4. Clear caches (if necessary, usually not)
sync; echo 3 > /proc/sys/vm/drop_caches
System Unresponsive
# 1. Check load average
uptime
# 2. Check for I/O wait
vmstat 1
# 3. Check for out of memory
dmesg | tail -50 | grep -i "out of memory"
# 4. Check swap usage
free -h
swapon --show
# 5. Check disk space
df -h
Logging and Persistence
Make journald Logs Persistent
sudo mkdir -p /var/log/journal
sudo systemd-tmpfiles --create --prefix /var/log/journal
sudo systemctl restart systemd-journald
View Past Boot Logs
# List boots
journalctl --list-boots
# View specific boot
journalctl -b -1 # Previous boot
journalctl -b -2 # Two boots ago
# View warnings and errors
journalctl -p warning
journalctl -b -1 -p err
The Senior Troubleshooting Mindset: First 60 Seconds
When you SSH into a burning server, run these commands in order:
# 1. Load averages - is it increasing or decreasing?
uptime
# 2. Kernel errors - OOM kills, disk I/O errors?
dmesg | tail
# 3. System-wide view - processes, memory, swap, CPU
vmstat 1
# 4. CPU balance across cores - is one core pegged?
mpstat -P ALL 1
# 5. Which process is causing the load?
pidstat 1
# 6. Disk latency and saturation
iostat -xz 1
# 7. Memory usage and cache
free -m
# 8. Network throughput
sar -n DEV 1
# 9. TCP connection failures/retransmits
sar -n TCP,ETCP 1
# 10. The classic overview
top
Quick Diagnostic: High CPU But Low User Time?
If CPU is high but us (user space) is low:
- High
sy (system): Too many context switches. Check for too many threads/processes. - High
wa (wait): Disk I/O bottleneck, not CPU. - High
in (interrupts): Network card or hardware interrupts.
File Descriptor Emergency
"Too many open files" (EMFILE) crashes production. Quick check and fix:
# Check current limits
ulimit -Sn # Soft limit (can be raised)
ulimit -Hn # Hard limit (ceiling)
# For a specific process
cat /proc/<pid>/limits | grep "open files"
# Count current open files
ls /proc/<pid>/fd | wc -l
Permanent fix in /etc/security/limits.conf:
* soft nofile 200000
* hard nofile 500000
Disk I/O Saturation Check
If iostat -xz 1 shows %util > 80%, the disk is saturated.
Quick questions: 1. Who is writing? → iotop 2. Is it random or sequential? → High IOPS with low throughput = random
- Is the disk failing? → Check
dmesg | tail for errors
Quick disk benchmark:
# Write speed (bypass cache)
dd if=/dev/zero of=testfile bs=1M count=1024 oflag=dsync
Conclusion
Effective process and performance management requires understanding system metrics and having the right tools ready. Use htop for interactive monitoring, vmstat and iostat for identifying bottlenecks, and know your signals for process control. Senior engineers isolate resources systematically (CPU, RAM, Disk, Network) rather than guessing randomly. Regular monitoring helps catch issues before they become critical.