Syntactic Sugar

I was recently asked how to deal with performance issues on Linux. Out of context, it’s a little like "how long is a piece of string," because what I’d look for first would depend on the situation. There’s generally two kinds of cases:

The system in front of me just got really slow, in the last few seconds.
Somebody tried to load a website, and it took forever to get a page back.

In the first case, there’s a good chance that something I just did is the cause of the slowdown. Did I just load a webpage in Firefox that brought up the Flash plugin? (That isn’t as much of an issue now that there’s easy options to make Flash only load and run when you click on it, but it used to be a very common cause.) Did I just run something that started trying to analyze every file in my system? That could take a while. Sometimes the proximate cause is pretty obvious, when you’re sitting in front of the computer as it begins to slow down. The thing that was most recently done is pretty likely to be the culprit.

But not always. There’s a lot that happens behind the scenes of the typical user experience, on all operating systems. Programs can be scheduled to run based on what time it is, or sit around doing nothing until something arrives from the network - like a printer queue, waiting for a print request. Activity can have a wide variety of causes.

But to narrow it down, the first thing I’ll do is open a terminal and run ps aux. This means, "Show me information about all the processes running in the system right now." That includes how long it’s been running, how much of the CPU and memory each uses, what state the program’s in, and so forth. There’s a good chance the thing that’s using a lot of memory and CPU is near the bottom of the list. If it isn’t, then I turn to htop. If it’s installed, it’s a major improvement over top. I wish it were in the default install of every major Linux system.

Why use ps at all if htop is available? Why use htop if you’re already running KDE and you have ksysguard available, which is surely even easier to use?

If your system is already slow, running anything new will take some time to load and run. Any shared libraries that aren’t already loaded will need to be loaded, and the more featured the tool is the longer it will take to start up to tell you what the performance of the system is.

I’ve had times where it took my system several minutes to load and run ps. Even using htop would have brought it to above a ten minute wait just to see the list of processes. When the system is not under heavy load ksysguard is great for looking around. But when things are already slow, stick to the command line.

Sometimes it’s obvious, and you can tell the offending process to knock it off directly from htop. But if the process isn’t owned by the regular user, you might not have permission to kill it unless you ran it with sudo to begin with. Oops. In that case, it’s easy enough to remember the process name, exit htop, run pgrep pattern (where pattern is the name of the process, to make sure you only have one process in the cross-hairs) then up, ctrl+a (go back to the beginning of the line), alt+w (delete following word, the 'pgrep'), type 'pkill', and hit return.

Why not just type pkill firefox (or whatever) instead of figuring out how to use shell history?

First, you’ll type less in the long run and save your hands. But more importantly, a key habit to get into is to be careful with those destructive commands, and don’t trust yourself too much. Design habits that will make it harder to screw up. Like this:

For example, anything that can follow ls (for listing files) can also follow rm (for removing files). If you want to delete something, always use ls first, like so: ls 1[0123][0-9]. Then use up, Ctrl+A, Alt+W, then type rm. This way, you can be certain that you’re referring to exactly the same thing when you run rm as when you run ls. Get into this habit before you do what I once did and accidentally blow away /usr instead of ./usr. Don’t retype important things. Any valid patterns (regexes in this case) that can follow pgrep can also follow pkill.

If you have no idea what the program is doing, it’s hard to know what the consequences will be of just killing it off. It’s potentially dangerous to do that: you could lose data or leave things in an inconsistent state. If it’s a system service, try just telling the system to stop it (for example, systemctl dbus stop) instead of killing it outright. This can be a big improvement, especially if the issue is with a database, despite goals for achieving ACID. Killing a process is a last resort.

If the system is completely unresponsive, it may be paging (informal synonym: swapping). This is where programs are trying to use more memory than really exists, and the system is set up to use the disk as "backup memory" which is ridiculously slow. On the one hand, this can be really bad because it can drag everything to be nearly indistinguishable from frozen.

It’s always worth remembering that main memory is about 100 times slower than the CPU, spinning disks are about 1000 times slower than main memory, and networks are usually quite a lot slower than that. When you’re pricing out a new computer, the disk is what will have the biggest impact. SSDs are more expensive for the same amount of storage but quite a lot faster than spinning drives.

On the other hand if the system is paging, then things may be merely almost frozen (not actually stuck) and work might eventually get done if you wait long enough. But if you’re paging with a slow spinning hard drive that’s trying to act like fast memory, you might be waiting a loooong time.

So the first thing to do is to figure out why the process is slow. After all, if httpd is the "heaviest" process on your system, it really won’t do any good to just kill it when it gets too big. That won’t solve the problem - there will be new httpd processes being created as new requests for web pages are made. Tools like ps and to a slightly lesser degree htop show you the instantaneous state of the system, but not what that state has been in the time leading up to the inquiry.

To get a sense of how busy the system is, the term to remember when you need to look it up is "load average," which is reported by the uptime command or at the top right of the display in htop (or from a wide variety of other tools) all of which will read it from /proc/loadavg. There are three numbers in the output of uptime, giving the load average over the last 1, 5, and 15 minutes respectively.

As the manpage for /proc/loadavg puts it, it’s "the number of jobs in the run queue (state R) or waiting for disk I/O (state D)". Describing the different states that a process may be in is complicated enough that it would be mostly a digression, but I’ll explain what this quote from the manpage describes. The run queue is the list of processes that aren’t waiting for something else to happen before they can be run. What would keep a process from being in the run queue?

Other Unix-like OSes besides Linux don’t include waiting for disk I/O among reasons to include a process in the calculation of the load average, which surprises me somewhat. There’s a subtle difference between waiting to run because something else is running instead, and waiting to run because nothing useful can be done until something specific has already happened. Normally, Linux makes processes take turns on each CPU core (very quick turns) and in a perfect system, nothing would ever be waiting for something specific to happen before it could be run. The only factor determining what order to run things in would be any assigned priorities.

Of course, this all relies on programs being well-behaved. Nobody wants someone in the back seat taking up all the driver’s focus by continually asking, "Are we there yet? Are we there yet?" It’s very easy to accidentally write the software equivalent of that, and always better for it to ask, "Can you wake me up when we’re there?" and then snooze a while.

Remember, CPUs are fast. A lot of processes are just sitting around snoozing — only snoozing because other processes are running instead, and not sitting around on the run queue waiting for something to happen. Modern CPUs even power down when there’s simply nothing else to do, which is good for saving energy. As I write this in focuswriter, I have a few tabs open in Firefox, a couple PDFs open, a few terminal windows, and my CPUs are spending about 97% of the time in the lowest power mode available that couldn’t be described as "off." How do I know? With powertop. It gives you lots of information about power consumption, and convenient ways to reduce it.

I just spent a few hours reading through the long list of available tools for making more detailed measurements of I/O, CPU usage, disk and network performance, and related software. There’s a lot of tools out there. I’ll describe how to use and interpret the data from a few more of them in future articles. I’ll also look more at what sorts of changes can be made to improve performance.

Until then, borrowing the words of Avdi Grimm, happy hacking!