Server Load – Basic Troubleshooting Tips

In this tutorial I will be covering the basics on trouble shooting server load and slow server issues.

For this tutorial you need to be using SSH as root in to your server. You can find steps of accessing the system console here
You can open a Linux/Mac OS X terminal or using PuTTY on Windows.
(Please note if you’re not comfortable using a command line environment this may not be for you)
(Side Note: This is an “At your own risk” guide. If you mess up, we are not responsible. Make sure you know what you are doing before you do it.)

So first off, you’re probably wondering “What exactly is server load?

A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator … sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they’re in for delays.
So, Bridge Operator, what numbering system are you going to use? How about:

  • 0.00 means there’s no traffic on the bridge at all. In fact, between 0.00 and 1.00 means there’s no backup, and an arriving car will just go right on.
  • 1.00 means the bridge is exactly at capacity. All is still good, but if traffic gets a little heavier, things are going to slow down.
  • over 1.00 means there’s backup. How much? Well, 2.00 means that there are two lanes worth of cars total — one lane’s worth on the bridge, and one lane’s worth waiting. 3.00 means there are three lane’s worth total — one lane’s worth on the bridge, and two lanes’ worth waiting. Etc.

This is basically what CPU load is. “Cars” are processes using a slice of CPU time (“crossing the bridge”) or queued up to use the CPU. Unix refers to this as the run-queue length: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.

Like the bridge operator, you’d like your cars/processes to never be waiting. So, your CPU load should ideally stay below 1.00. Also like the bridge operator, you are still ok if you get some temporary spikes above 1.00 … but when you’re consistently above 1.00, you need to worry.

Running the command:
[snippet id=”19″]
will display a wall of information output to the current SSH screen you are looking at. This command shows the top processes that are currently running on your server. At first this information may seem like a lot to digest all at once but if you take the time to learn and understand what you are looking at, this becomes a very important tool in troubleshooting load problems and slow server problems.
In the top right you will see:
[snippet id=”20″]
This is the current load. This is the one, five and fifteen minute load averages. As you can see, my server is well optimized and running well because the load is near 0.

(Note: for a more human readable top processes program, I recommend installing htop)

So you’re saying ideally to stay at 1?
Well, not exactly. The problem with a load of 1.00 is that you have no headroom. In practice, many sysadmins will draw a line at 0.70:

  • The “Need to Look into it” Rule of Thumb: 0.70 If your load average is staying above > 0.70, it’s time to investigate before things get worse.
  • The “Fix this now” Rule of Thumb: 1.00. If your load average stays above 1.00, find the problem and fix it now. Otherwise, you’re going to get woken up in the middle of the night, and it’s not going to be fun.
  • The “Arrgh, it’s 3AM WTF?” Rule of Thumb: 5.0. If your load average is above 5.00, you could be in serious trouble, your box is either hanging or slowing way down, and this will (inexplicably) happen in the worst possible time like in the middle of the night or when you’re presenting at a conference. Don’t let it get there.

Credit to Scout for the Bridge Analogy

“My Load is higher than 1, is that bad?”
While a load can be higher than 1, it depends on how many CPU cores you have on your system.
A load of 2 is high for a single CPU system, but on a system that has two CPU cores, it’s exactly at capacity and running fine, and 2 on a Quad CPU system is at about half.

Now that you have an idea on what load is, you can take this info and the output from top/htop to see what processes are currently taking the most resources.
Examining top as it’s running you will see what looks like this:
[snippet id=”22″]
While that may look like a lot, it’s actually not. This displays your current system processes in a raw output format.

Note: If you don’t know how many cores you have, run this command:
[snippet id=”29″]
On my server, it returns that I have 2 available cores.

Commonly-used columns:

  1. PID – Process ID – The unique ID of the process (commonly used with the kill command)
  2. USER – Username which the process is running under
  3. PR – Priority for the process (ranges from -20 for very important processes to 19 for unimportant processes)
  4. NI – Nice value modifies the priority of the process (a negative value will increase the priority of the process and a positive value will decrease the priority of the process)
  5. VIRT – Total amount of virtual memory used by the process
  6. RES – Resident size (kb) – Non-swapped physical memory which the process has used
  7. SHR – Shared memory size (kb) – Amount of shared memory which the process has used (shared memory is memory which could be allocated to other processes)
  8. S – Process status – Possible values:
    • R – Running
    • D – Sleeping (may not be interrupted)
    • S – Sleeping (may be interrupted)
    • T – Traced or stopped
    • Z – Zombie or “hung” process
  9. %CPU – Percentage of CPU time the process was using at the time top last updated
  10. %MEM – Percentage of memory (RAM) the process was using at the time top last updated
  11. TIME+ – Cumulative CPU time which the process and children of the process have used
  12. COMMAND – Name of the process or the path to the command used to start the process (press c to toggle between the name of the process and the path to the command used to start the process)

For a more human readable format, again I suggest using htop. That looks like:

htop rules

using htop

Now that you have the proper tools and a “just enough to be dangerous” knowledge, lets take this information and see what to do with it.
From top or htop, you can see your high load in the top right, at the bottom in the list of processes you will see on the left the PID(Process ID) of the running processes. For example, Apache(HTTPD) may write it’s main process number to a PID file(which is a regular text file, nothing more than that), and later use the information stored there to stop it’s running process. If more than one process is running, in this case we’ll use PHP as an example because a website will run on php, it will have more than one PID for each running process. If 5 users open your home WordPress home page, then 5 PIDs will display, one for each open page. As well, since WordPress uses MySQL for it’s Database storage, you will see MySQL PIDs flair up as a result.

If you find your server running slow, you want to check first what is making your server slow. Nothing other than software running on your server would make it slow so this is where you need to check. If you see multiple PHP processes, a high load, and the server is running slow. Chances are it’s a site you have that is killing the server.

So what do I do about the trouble processes that are running, how do I stop them?
This is a good question that can be answered many ways. The ideal thing would be to stop the processes and investigate why your site is keeping them open and optimize how your site runs and and investigate why it puts so much strain it puts on your server. There is not a “best way” to go about this as every tech will tell you a different method. That’s not a bad thing, but the different methods are part of what make Linux so great.

Hypothetical slow server situation.
Okay so let’s say you have a WordPress site. You just installed a few plugins and let them run out of the box after installation with no configuration. You let it run for a few days and you notice that your page loads are slower than it used to be. Before blaming your host and server for slow issues first check how well optimized your site is. Check using a site scanning tool. We recommend Gtmetrix as it displays Grades for each page loading component as well has a element load timeline. Once you’ve done that you can follow our guide on Optimizing WordPress to help get your site faster.
In the event that your sites won’t load, and you cannot get in to cPanel, it is most likely that your server is under a high load because of the recent changes you made to your sites. Optimization is a great way to prevent your server from slowing down, but sometimes you will make a change and it will have a high impact on your server. The next step is to check your load. After logging in as root, you run top/htop and discover that your server is under a pretty heavy load for a two core server.
[snippet id=”23″]
At this point, your server is slow, and input of commands may reflect the slowness of the server but don’t panic. Just let it do it’s thing, don’t try to force anything. You  check the actively running processes and you see that there are a lot of open PHP processes open. These can be good and bad, good because you have visitors and that means you have traffic, and bad because if your site is not optimized it will slow down and even crash your server. If your load is high, these processes are waiting their turn to “cross the bridge” and deliver content. The speed at which they “cross the bridge” is dependent on your site software and how much it has to process, so if your sites are going 20 Miles below the speed limit because of the amount of content it has to deliver, chances are it will cause “slow’n’go” traffic across the bridge. So what you can do as a temporary fix until you fix your software is stop or “kill” all existing processes. You can run a command to kill these processes based on the PID number associated with the process, or you can use a more complex command to stop the processes.

Again, if you don’t feel comfortable with using commands or killing processes. Please either stop and hire a professional, or read up on these commands before you execute them. If you are ready for this, continue on.

First, you want to see what that process is doing. With top you will see something like this:
[snippet id=”25″]
This is an active page load, this process spawned when opening the home page of carvertown.com, hence the file path to index.php(to see the exact source that is running press c to toggle between the name of the process and the path to the command used to start the process). As you can see, this process hasn’t been alive for very long, but it’s consuming quite a large amount of CPU time within those few seconds as you can see from the CPU% column, it’s at 16.6 for those few seconds it takes to load. While this isn’t a problem for this website and server because it’s only a few seconds and the load was low so there was no wait to “cross the bridge” and deliver the content. When it becomes an issue is when you have several of these waiting to “cross the bridge”. From the top output, you can see this process was spawned from php54. It’s a PHP 5.4 process, because WordPress is written in PHP on page load it will spawn PHP processes. The standard top output can show where it’s coming from, as an alternative you can use the ps command(reports a snapshot of the current processes). So for example:
[snippet id=”26″]
I ran the command ps faux, that command lists processes like the top command, but in a single snapshot instead of constantly updating. We use a pipe defined by the | which means take the output you get from ps faux and send it to the grep command to pull only the data we grep for(grep displays anything that matches “php” in this instance). From the example you can see the first process is the WordPress admin page is using CPU time to process things. The second process is ignorable, that’s our “grep php” command.
If you find that you have a ton of these php processes open, and the CPU time is high and they don’t look like they are going to finish anytime soon, you can kill those processes with a kill command. From the example, you can take the PID 383 for the open WordPress admin page, and use the kill command to stop it completely.
[snippet id=”27″]
This will kill the current process identified by PID 383. Pretty simple huh?
Now, if you have a ton of these and it’s not just a single process causing all the fuss, then you may want to use a more complex command. I use this command to kill off processes by a certain criteria(in this case I grep for the command name).
[snippet id=”28″]
(Note: kill -9 should only ever be used as a last resort as it will create ‘orphaned shared memory segments’ and is therefore best avoided if possible)
That will have stopped all current PHP processes. If you still have visitors on your site, it may be necessary to even stop apache before killing processes so no new PHP processes start up.

After identifying and stopping any trouble processes, it is now time to check your site files, update, configure, and optimize. You will see the load drop down and your server will become more responsive as a result of a smaller load. Once you have made any necessary changes to your sites via either FTP or SSH you can start Apache back up (if you had stopped it) and monitor your load again with top/htop. Open a few pages in a browser to see if they start hanging again or cause problems. If you can see less problems but still want to troubleshoot, you can do so.

 

Posted in Archived Posts and tagged .