About us Guides Projects Contacts
Админка
please wait

Introduction

Understanding process management and performance monitoring is essential for maintaining healthy systems. When applications slow down or servers become unresponsive, you need tools and techniques to diagnose issues quickly. This guide covers practical approaches to monitoring, troubleshooting, and optimizing Linux system performance.

Process Monitoring

Basic Commands

# List all processes
ps aux
# Process tree (parent-child relationships)
ps auxf
pstree -p
# Real-time process monitoring
top
htop # More user-friendly alternative
# Filter by user
ps -u username
# Filter by process name
ps aux | grep nginx
pgrep nginx

Understanding top/htop Output

top - 14:32:01 up 45 days, 2:15, 3 users, load average: 0.52, 0.58, 0.59
Tasks: 256 total, 1 running, 255 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.3 us, 1.2 sy, 0.0 ni, 93.1 id, 0.2 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 15842.3 total, 712.3 free, 7256.4 used, 7873.6 buff/cache
MiB Swap: 2048.0 total, 2040.0 free, 8.0 used. 8114.1 avail Mem

Key metrics:

  • load average: 1/5/15-minute CPU load. Compare to the number of CPUs.
  • us: User space CPU time
  • sy: System/kernel CPU time
  • id: Idle CPU time
  • wa: Waiting for I/O
  • buff/cache: Memory used for caching (available if needed)

htop Shortcuts

KeyAction
F6Sort by column
F9Kill process
F4Filter by name
tTree view
HHide user threads
SpaceTag process

Memory Management

Memory Usage

# Memory summary
free -h
# Detailed memory info
cat /proc/meminfo
# Memory by process
ps aux --sort=-%mem | head -20
# Detailed memory map for process
pmap <pid>

Understanding Memory

 total used free shared buff/cache available
Mem: 15Gi 7.1Gi 696Mi 416Mi 7.7Gi 7.9Gi
Swap: 2.0Gi 8.0Mi 2.0Gi
  • available > free because buff/cache can be reclaimed
  • High buff/cache is normal and good (caching disk reads)
  • Watch available memory, not free

Finding Memory Leaks

# Monitor process memory over time
while true; do
ps -o pid,vsz,rss,comm -p $(pgrep myapp)
sleep 60
done >> /var/log/memory-monitor.log
# Check if process memory keeps growing
watch -n 5 'ps -o pid,rss,command -p $(pgrep myapp)'

CPU Management

CPU Usage by Process

# Top CPU consumers
ps aux --sort=-%cpu | head -20
# CPU usage per core
mpstat -P ALL 1
# Real-time CPU monitoring
htop
# Press F2 → Display options → Check "Detailed CPU time"

Identifying CPU Bottlenecks

# Check for CPU-bound processes
top
# Look for processes with high %CPU
# Check for I/O wait
vmstat 1
# High wa (wait) indicates an I/O bottleneck, not CPU
# Process strace for debugging
strace -c -p <pid> # System call statistics

Disk I/O Monitoring

Real-time I/O

# I/O by process
iotop
iotop -oP # Only processes doing I/O
# Disk activity
iostat -xz 1
# I/O wait time
vmstat 1
# Watch the "wa" column

Understanding iostat

iostat -xz 1
Device r/s w/s rMB/s wMB/s %util
sda 45.00 120.00 1.80 48.00 85.00
  • %util > 80%: Disk is becoming a bottleneck
  • r/s, w/s: Reads and writes per second
  • await: Average wait time (ms)—high = problem

Find Large Files

# Find files over 1GB
find /var -type f -size +1G
# Disk usage by directory
du -sh /*
du -sh /var/*
# Interactive disk usage
ncdu /var

Process Control

Signals

# Graceful termination (SIGTERM)
kill <pid>
kill -15 <pid>
# Force kill (SIGKILL)
kill -9 <pid>
# Kill by name
pkill nginx
killall nginx
# Send HUP (reload config)
kill -HUP <pid>
# Common signals:
# SIGTERM (15): Graceful shutdown
# SIGKILL (9): Force kill (cannot be caught)
# SIGHUP (1): Hangup/reload
# SIGSTOP (19): Pause process
# SIGCONT (18): Resume process

Process Priority

# Start with lower priority (nicer)
nice -n 10 ./long-running-script.sh
# Change running process priority
renice -n 10 -p <pid>
# Priority range: -20 (highest) to 19 (lowest)
# Only root can set negative nice values

Background Processes

# Run in background
./script.sh &
# Move current process to background
Ctrl+Z # Suspend
bg # Resume in background
# Bring to foreground
fg
# List background jobs
jobs
# Keep running after logout
nohup ./script.sh &
# Output goes to nohup.out
# Or use disown
./script.sh &
disown %1

System Resource Limits

View Limits

# Current shell limits
ulimit -a
# Process limits
cat /proc/<pid>/limits

Set Limits

# Temporary (current session)
ulimit -n 65535 # Max open files
# Permanent: /etc/security/limits.conf
# <user> <type> <item> <value>
www-data soft nofile 65535
www-data hard nofile 65535
* soft nproc 4096

Monitoring Tools

vmstat - Virtual Memory Statistics

vmstat 1
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
# r b swpd free buff cache si so bi bo in cs us sy id wa st
# 1 0 0 712340 125612 7935460 0 0 1 45 102 234 5 1 93 1 0
  • r: Running processes (> CPUs = overload)
  • b: Blocked processes (waiting for I/O)
  • si/so: Swap in/out (should be 0)
  • bi/bo: Block I/O

sar - System Activity Reporter

# Install sysstat for sar
sudo apt install sysstat
# CPU history
sar -u
# Memory history
sar -r
# Disk I/O history
sar -d
# Network history
sar -n DEV

dstat - Comprehensive Statistics

# Install
sudo apt install dstat
# All-in-one monitoring
dstat -cdngy
# CPU, disk, network, paging, system

Troubleshooting Workflows

High CPU Usage

# 1. Identify the process
top -c
# Note PID of high-CPU process
# 2. Check what it's doing
strace -p <pid> 2>&1 | head -50
# 3. Check if it's using all cores
mpstat -P ALL 1
# 4. Profile if needed (for your own apps)
perf top -p <pid>

High Memory Usage

# 1. Check overall memory
free -h
# 2. Find memory hogs
ps aux --sort=-%mem | head -10
# 3. Check for memory leaks
pmap <pid> | tail -1
# 4. Clear caches (if necessary, usually not)
sync; echo 3 > /proc/sys/vm/drop_caches

System Unresponsive

# 1. Check load average
uptime
# 2. Check for I/O wait
vmstat 1
# 3. Check for out of memory
dmesg | tail -50 | grep -i "out of memory"
# 4. Check swap usage
free -h
swapon --show
# 5. Check disk space
df -h

Logging and Persistence

Make journald Logs Persistent

sudo mkdir -p /var/log/journal
sudo systemd-tmpfiles --create --prefix /var/log/journal
sudo systemctl restart systemd-journald

View Past Boot Logs

# List boots
journalctl --list-boots
# View specific boot
journalctl -b -1 # Previous boot
journalctl -b -2 # Two boots ago
# View warnings and errors
journalctl -p warning
journalctl -b -1 -p err

The Senior Troubleshooting Mindset: First 60 Seconds

When you SSH into a burning server, run these commands in order:

# 1. Load averages - is it increasing or decreasing?
uptime
# 2. Kernel errors - OOM kills, disk I/O errors?
dmesg | tail
# 3. System-wide view - processes, memory, swap, CPU
vmstat 1
# 4. CPU balance across cores - is one core pegged?
mpstat -P ALL 1
# 5. Which process is causing the load?
pidstat 1
# 6. Disk latency and saturation
iostat -xz 1
# 7. Memory usage and cache
free -m
# 8. Network throughput
sar -n DEV 1
# 9. TCP connection failures/retransmits
sar -n TCP,ETCP 1
# 10. The classic overview
top

Quick Diagnostic: High CPU But Low User Time?

If CPU is high but us (user space) is low:

  • High sy (system): Too many context switches. Check for too many threads/processes.
  • High wa (wait): Disk I/O bottleneck, not CPU.
  • High in (interrupts): Network card or hardware interrupts.

File Descriptor Emergency

"Too many open files" (EMFILE) crashes production. Quick check and fix:

# Check current limits
ulimit -Sn # Soft limit (can be raised)
ulimit -Hn # Hard limit (ceiling)
# For a specific process
cat /proc/<pid>/limits | grep "open files"
# Count current open files
ls /proc/<pid>/fd | wc -l

Permanent fix in /etc/security/limits.conf:

* soft nofile 200000
* hard nofile 500000

Disk I/O Saturation Check

If iostat -xz 1 shows %util > 80%, the disk is saturated.

Quick questions: 1. Who is writing? → iotop 2. Is it random or sequential? → High IOPS with low throughput = random

  1. Is the disk failing? → Check dmesg | tail for errors

Quick disk benchmark:

# Write speed (bypass cache)
dd if=/dev/zero of=testfile bs=1M count=1024 oflag=dsync

Conclusion

Effective process and performance management requires understanding system metrics and having the right tools ready. Use htop for interactive monitoring, vmstat and iostat for identifying bottlenecks, and know your signals for process control. Senior engineers isolate resources systematically (CPU, RAM, Disk, Network) rather than guessing randomly. Regular monitoring helps catch issues before they become critical.

 
 
 
Языки
Темы
Copyright © 1999 — 2026
ZK Interactive