Server Optimization Guide

Server optimization guide

There are two aspects with regard to optimizing Phusion Passenger performance.

The first aspect is settings tuning. Phusion Passenger's default settings are not aimed at optimizing, but at safety. The defaults are designed to conserve resources, to prevent server overload and to keep web apps up and running. To optimize for performance, you need to tweak some settings whose values depend on your hardware and your environment.

Besides Phusion Passenger settings, you may also want to tune kernel-level settings.

The second aspect is using performance-enhancing features. This requires small application-level changes.

If you are optimizing Phusion Passenger for the purpose of benchmarking then you should also follow the benchmarking recommendations.

Minimizing process spawning

By default, Phusion Passenger spawns and shuts down application processes according to traffic. This allows it to use more resources during busy times, while conserving resources during idle times. This is especially useful if you host more than 1 app on a single server: if not all apps are used at the same time, then you don't have to keep all apps running at the same time.

However, spawning a process takes a lot of time (in the order of 10-20 seconds for a Rails app), and CPU usage will be near 100% during spawning. Therefore, while spawning, your server will be slower at performing other activities, such as handling requests.

For consistent performance, it is thus recommended that you configure a static process pool: telling Phusion Passenger to use a fixed number of processes, instead of spawning and shutting them down dynamically.

Standalone

Run passenger start with --max-pool-size=N --min-instances=N, where N is the number of processes you want.

Nginx

Let N be the number of processes you want. Set the following configuration in your http block:

passenger_max_pool_size N;

Set the following configuration in your server block:

passenger_min_instances N;

You should also configure passenger_pre_start in the http block so that your app is started during web server launch:

# Refer to the Users Guide for more information about passenger_pre_start.
passenger_pre_start http://your-website-url.com;

Apache

Let N be the number of processes you want. Set the following configuration in the global context:

PassengerMaxPoolSize N

Set the following configuration in your virtual host block:

PassengerMinInstances N

You should also configure PassengerPreStart in the global context so that your app is started during web server launch:

# Refer to the Users Guide for more information about PassengerPreStart.
PassengerPreStart http://your-website-url.com

Maximizing throughput

This section provides guidance on maximizing Phusion Passenger's throughput. The amount of throughput that Phusion Passenger handles is proportional to the number of processes or threads that you've configured. More processes/threads generally means more throughput, but there is an upper limit. Past a certain value, further increasing the number of processes/threads won't help. If you increase the number of processes/threads even further, then performance may even go down.

The optimal value depends on the hardware and the environment. This section will provide you with formulas to calculate that optimal value. The following factors are involved in calculation:

The formulas in this section assume that your machine is dedicated to Phusion Passenger. If your machine also hosts other software (e.g. a database) then you'll need to tweak the formulas a little bit.

Tuning the application process and thread count

Step 1: determining the application's memory usage

The amount of memory that your application uses on a per-process basis, is key to our calculation. You should first figure out how much memory your application typically needs. Every application has different memory usage patterns, so the typical memory usage is best determined by observation.

Run your app for a while, then run passenger-status at different points in time to examine memory usage. Then calculate the average of your data points. In the rest of this section, we'll refer to the amount of memory (in MB) that an application process needs, as RAM_PER_PROCESS.

In our experience, a typical medium-sized single-threaded Rails application process can use 150 MB of RAM on a 64-bit machine, even when the spawning method is set to "smart".

Step 2: determine the system's limits

First, let's define the maximum number of (single-threaded) processes, or the number of threads, that you can comfortably have given the amount of RAM you have. This is a reasonable upper limit that you can reach without degrading system performance. This number is not the final optimal number, but is merely used for further caculations in later steps.

There are two formulas that we can use, depending on what kind of concurrency model your application is using in production.

Purely single-threaded multi-process formula

If you didn't explicitly configure multithreading, then you are using using this concurrency model. Or, if you are not using Ruby (e.g. if you using Python, Node.js or Meteor), then you are also using this concurrency model, because Phusion Passenger only supports multithreading for Ruby apps.

The formula is then as follows:

max_app_processes = (TOTAL_RAM * 0.75) / RAM_PER_PROCESS

It is derived as follows:

Multithreaded formula

The formula for multithreaded concurrency is as follows:

max_app_threads_per_process =
  ((TOTAL_RAM * 0.75) - (CHOSEN_NUMBER_OF_PROCESSES * RAM_PER_PROCESS * 0.9)) /
  (RAM_PER_PROCESS / 10)

Here, CHOSEN_NUMBER_OF_PROCESSES is the number of application processes you want to use. In case of Ruby, Python, Node.js and Meteor, this should be equal to NUMBER_OF_CPUS. This is because all these languages can only utilize a single CPU core per process. If you're using a language runtime that does not have a Global Interpreter Lock, e.g. JRuby or Rubinius, then CHOSEN_NUMBER_OF_PROCESSES can be 1.

The formula is derived as follows:

On 32-bit systems, max_app_threads_per_process should not be higher than about 200. Assuming an 8 MB stack size per thread, you will run out of virtual address space if you go much further. On 64-bit systems you don’t have to worry about this problem.

Step 3: derive the applications' needs

The earlier two formulas were not for calculating the number of processes or threads that application needs, but for calculating how much the system can handle without getting into trouble. Your application may not actually need that many processes or threads! If your application is CPU-bound, then you only need a small multiple of the number of CPUs you have. Only if your application performs a lot of blocking I/O (e.g. database calls that take tens of milliseconds to complete, or you call to Twitter) do you need a large number of processes or threads.

Armed with this knowledge, we derive the formulas for calculating how many processes or threads we actually need.

Step 3: configure Phusion Passenger

Purely single-threaded multi-process scenarios

Multithreaded scenarios

In order to use multithreading you must use Phusion Passenger Enterprise. The open source version of Phusion Passenger does not support multithreading.

Possible step 4: configure Rails

Only if you're using the multithreaded concurrency model do you need to configure Rails. You need to enable thread-safety by setting config.thread_safe! in config/environments/production.rb. In Rails 4.0 this is on by default for the production environment, but in earlier versions you had to enable it manually.

You should also increase the ActiveRecord pool size because it limits concurrency. You can configure it in config/database.yml. Set the pool value to the number of threads per application process. But if you believe your database cannot handle that much concurrency, keep it at a low value.

Example 1: purely single-threaded multi-process scenario with lots of blocking I/O, in a low-memory server

Suppose you have:

Then the calculation is as follows:

# Use this formula for purely single-threaded multi-process deployments.
max_app_processes = (1024 * 0.75) / 150 = 5.12
desired_app_processes = max_app_processes = 5.12

Conclusion: you should use 5 or 6 processes. Phusion Passenger should be configured as follows:

# Standalone
passenger start --max-pool-size=5 --min-instances=5

# Nginx
passenger_max_pool_size 5;
passenger_min_instances 5;

# Apache
PassengerMaxPoolSize 5
PassengerMinInstances 5

However a concurrency of 5 or 6 is way too low if your application performs a lot of blocking I/O. You should use a multithreaded deployment instead, or you need to get more RAM so you can run more processes.

Example 2: purely single-threaded multi-process scenario with lots of blocking I/O, in a high-memory server

Suppose you have:

Then the calculation is as follows:

# Use this formula for purely single-threaded multi-process deployments.
max_app_processes = (1024 * 32 * 0.75) / 150 = 163.84
desired_app_processes = max_app_processes = 163.84

Conclusion: you should use 163 or 164 processes. This number seems high, but the value is correct. Because your app performs a lot of blocking I/O, you need a lot of I/O concurrency. The more concurrency the better. The amount of concurrency scales linearly with the number of processes, which is why you end up with such a large number.

Phusion Passenger should be configured as follows:

# Standalone
passenger start --max-pool-size=163 --min-instances=163

# Nginx
passenger_max_pool_size 163;
passenger_min_instances 163;

# Apache
PassengerMaxPoolSize 163
PassengerMinInstances 163

Note that in this example, 163-164 processes is merely the maximum number of processes that you can run, without overloading your RAM. It does not mean that you have enough concurrency for your application! If you need more concurrency, you should use a multithreaded deployment instead.

Example 3: multithreaded deployment with lots of blocking I/O

Consider the same machine as in example 2:

But this time you're using multithreading with 8 application processes (because you have 8 CPUs). How many threads do you need per process?

# Use this formula for multithreaded deployments.
max_app_threads_per_process
= ((1024 * 32 * 0.75) - (8 * 150)) / (150 / 10)
= 1558.4

Conclusion: you should use 1558 threads per process.

# Standalone
passenger start --max-pool-size=8 --min-instances=8 --concurrency-model=thread --thread-count=1558

# Nginx
passenger_max_pool_size 8;
passenger_min_instances 8;
passenger_concurrency_model thread;
passenger_thread_count 1558;

# Apache
PassengerMaxPoolSize 8
PassengerMinInstances *
PassengerConcurrencyModel thread
PassengerThreadCount 1558

Because of the huge number of threads, this only works on a 64-bit platform. If you're on a 32-bit platform, consider lowering the number of threads while raising the number of processes. For example, you can double the number of processes (to 16) and halve the number of threads (to 779).

Configuring the web server

If you're using Nginx then it does not need additional configuration. Nginx is evented and already supports a high concurrency out of the box.

If you're using Apache, then prefer the worker MPM (which uses a combination of processes and threads) or the event MPM (which is similar to the worker MPM, but better) over the prefork MPM (which only uses processes) whenever possible. PHP requires prefork, but if you don't use PHP then you can probably use one of the other MPMs. Make sure you set a low number of processes and a moderate to high number of threads.

Because Apache performs a lot of blocking I/O (namely HTTP handling), you should give it a lot of threads so that it has a lot of concurrency. Apache's concurrency must be somewhat larger than the total number of application processes or total number of application threads. When considering example 3, the Apache concurrency must be larger than 8 * 1558 = 12464.

If you cannot use the event MPM, consider putting Apache behind an Nginx reverse proxy, with response buffering turned on on the Nginx side. This reliefs a lot of concurrency problems from Apache. If you can use the event MPM then adding Nginx to the mix does not provide many advantages.

Summary

Performance-enhancing features

Turbocaching

Phusion Passenger supports turbocaching since version 4. Turbocaching is an HTTP cache built inside Phusion Passenger. When used correctly, the cache can accelerate your app tremendously.

To utilize turbocaching, you only need to set HTTP caching headers. Please refer to Google's HTTP caching tutorial. Phusion Passenger takes advantage of the HTTP headers automatically.

Out-of-band garbage collection

Phusion Passenger supports out-of-band garbage collection for Ruby apps. With this feature enabled, Phusion Passenger can run the garbage collector in between requests, so that the garbage collector doesn't delay the app as much. Please refer to the Users Guide for more information about this feature.

Benchmarking recommendations

Tooling recommendations

Operating system recommendations

Server and application recommendations