There are two aspects with regard to optimizing Phusion Passenger performance.
The first aspect is settings tuning. Phusion Passenger's default settings are not aimed at optimizing, but at safety. The defaults are designed to conserve resources, to prevent server overload and to keep web apps up and running. To optimize for performance, you need to tweak some settings whose values depend on your hardware and your environment.
Besides Phusion Passenger settings, you may also want to tune kernel-level settings.
The second aspect is using performance-enhancing features. This requires small application-level changes.
If you are optimizing Phusion Passenger for the purpose of benchmarking then you should also follow the benchmarking recommendations.
By default, Phusion Passenger spawns and shuts down application processes according to traffic. This allows it to use more resources during busy times, while conserving resources during idle times. This is especially useful if you host more than 1 app on a single server: if not all apps are used at the same time, then you don't have to keep all apps running at the same time.
However, spawning a process takes a lot of time (in the order of 10-20 seconds for a Rails app), and CPU usage will be near 100% during spawning. Therefore, while spawning, your server will be slower at performing other activities, such as handling requests.
For consistent performance, it is thus recommended that you configure a static process pool: telling Phusion Passenger to use a fixed number of processes, instead of spawning and shutting them down dynamically.
Run passenger start
with --max-pool-size=N --min-instances=N
, where N
is the number of processes you want.
Let N
be the number of processes you want. Set the following configuration in your http
block:
passenger_max_pool_size N;
Set the following configuration in your server
block:
passenger_min_instances N;
You should also configure passenger_pre_start
in the http
block so that your app is started during web server launch:
# Refer to the Users Guide for more information about passenger_pre_start.
passenger_pre_start http://your-website-url.com;
Let N
be the number of processes you want. Set the following configuration in the global context:
PassengerMaxPoolSize N
Set the following configuration in your virtual host block:
PassengerMinInstances N
You should also configure PassengerPreStart
in the global context so that your app is started during web server launch:
# Refer to the Users Guide for more information about PassengerPreStart.
PassengerPreStart http://your-website-url.com
This section provides guidance on maximizing Phusion Passenger's throughput. The amount of throughput that Phusion Passenger handles is proportional to the number of processes or threads that you've configured. More processes/threads generally means more throughput, but there is an upper limit. Past a certain value, further increasing the number of processes/threads won't help. If you increase the number of processes/threads even further, then performance may even go down.
The optimal value depends on the hardware and the environment. This section will provide you with formulas to calculate that optimal value. The following factors are involved in calculation:
Number of CPUs. True (hardware) concurrency cannot be higher than the number of CPUs. In theory, if all processes/threads on your system use the CPUs constantly, then:
NUMBER_OF_CPUS
processes/threads.Having more processes than CPUs may decrease total throughput a little thanks to context switching overhead, but the difference is not big because OSes are good at context switching these days.
On the other hand, if your CPUs are not used constantly, e.g. because they’re often blocked on I/O, then the above does not apply and increasing the number of processes/threads does increase concurrency and throughput, at least until the CPUs are saturated.
Blocking I/O. This covers all blocking I/O, including hard disk access latencies, database call latencies, web API calls, etc. Handling input from the client and output to the client does not count as blocking I/O, because Phusion Passenger has buffering layers that relief the application from worrying about this.
The more blocking I/O calls your application process/thread makes, the more time it spends on waiting for external components. While it’s waiting it does not use the CPU, so that’s when another process/thread should get the chance to use the CPU. If no other process/thread needs CPU right now (e.g. all processes/threads are waiting for I/O) then CPU time is essentially wasted. Increasing the number processes or threads decreases the chance of CPU time being wasted. It also increases concurrency, so that clients do not have to wait for a previous I/O call to be completed before being served.
The formulas in this section assume that your machine is dedicated to Phusion Passenger. If your machine also hosts other software (e.g. a database) then you'll need to tweak the formulas a little bit.
The amount of memory that your application uses on a per-process basis, is key to our calculation. You should first figure out how much memory your application typically needs. Every application has different memory usage patterns, so the typical memory usage is best determined by observation.
Run your app for a while, then run passenger-status
at different points in time to examine memory usage. Then calculate the average of your data points. In the rest of this section, we'll refer to the amount of memory (in MB) that an application process needs, as RAM_PER_PROCESS
.
In our experience, a typical medium-sized single-threaded Rails application process can use 150 MB of RAM on a 64-bit machine, even when the spawning method is set to "smart".
First, let's define the maximum number of (single-threaded) processes, or the total number of threads, that you can comfortably have given the amount of RAM you have. This is a reasonable upper limit that you can reach without degrading system performance. This number is not the final optimal number, but is merely used for further caculations in later steps.
There are two formulas that we can use, depending on what kind of concurrency model your application is using in production.
Purely single-threaded multi-process formula
If you didn't explicitly configure multithreading, then you are using using this concurrency model. Or, if you are not using Ruby (e.g. if you using Python, Node.js or Meteor), then you are also using this concurrency model, because Phusion Passenger only supports multithreading for Ruby apps.
The formula is then as follows:
max_app_processes = (TOTAL_RAM * 0.75) / RAM_PER_PROCESS
It is derived as follows:
(TOTAL_RAM * 0.75)
: We can assume that there must be at least 25% of free RAM that the operating system can use for other things. The result of this calculation is the RAM that is freely available for applications. If your system runs a lot of services and thus has less memory available for Phusion Passenger and its apps, then you should lower 0.75
to some constant that you think is appropriate./ RAM_PER_PROCESS
: Each process consumes a roughly constant amount of RAM, so the maximum number of processes is a single devision between the aforementioned calculation and this constant.Multithreaded formula
The formula for multithreaded concurrency is as follows:
max_app_threads_per_process =
((TOTAL_RAM * 0.75) - (CHOSEN_NUMBER_OF_PROCESSES * RAM_PER_PROCESS * 0.9)) /
(RAM_PER_PROCESS / 10) /
CHOSEN_NUMBER_OF_PROCESSES
Here, CHOSEN_NUMBER_OF_PROCESSES
is the number of application processes you want to use. In case of Ruby, Python, Node.js and Meteor, this should be equal to NUMBER_OF_CPUS
. This is because all these languages can only utilize a single CPU core per process. If you're using a language runtime that does not have a Global Interpreter Lock, e.g. JRuby or Rubinius, then CHOSEN_NUMBER_OF_PROCESSES
can be 1.
The formula is derived as follows:
(TOTAL_RAM * 0.75)
: The same as explained earlier.(CHOSEN_NUMBER_OF_PROCESSES * RAM_PER_PROCESS * 0.9)
: This calculates the amount of memory that all the processes together would consume, assuming they're not running any threads. When this is deducted from TOTAL_RAM * 0.75
, we end up with the amount of RAM available to application threads./ (RAM_PER_PROCESS / 10)
: We estimate that a thread consumes ~10% of the amount of memory a process would, so we divide the amount of RAM available to threads with this number. What we get is the number of threads that the system can handle.On 32-bit systems, max_app_threads_per_process
should not be higher than about 200. Assuming an 8 MB stack size per thread, you will run out of virtual address space if you go much further. On 64-bit systems you don’t have to worry about this problem.
The earlier two formulas were not for calculating the number of processes or threads that application needs, but for calculating how much the system can handle without getting into trouble. Your application may not actually need that many processes or threads! If your application is CPU-bound, then you only need a small multiple of the number of CPUs you have. Only if your application performs a lot of blocking I/O (e.g. database calls that take tens of milliseconds to complete, or you call to Twitter) do you need a large number of processes or threads.
Armed with this knowledge, we derive the formulas for calculating how many processes or threads we actually need.
If your application performs a lot of blocking I/O then you should give it as many processes and threads as possible:
# Use this formula for purely single-threaded multi-process scenarios.
desired_app_processes = max_app_processes
# Use this formula for multithreaded scenarios.
desired_app_threads_per_process = max_app_threads_per_process
If your application doesn’t perform a lot of blocking I/O, then you should limit the number of processes or threads to a multiple of the number of CPUs to minimize context switching:
# Use this formula for purely single-threaded multi-process scenarios.
desired_app_processes = min(max_app_processes, NUMBER_OF_CPUS)
# Use this formula for multithreaded scenarios.
desired_app_threads_per_process = min(max_app_threads_per_process, 2 * NUMBER_OF_CPUS)
Purely single-threaded multi-process scenarios
passenger start
with --max-pool-size=<desired_app_processes> --min-instances=<desired_app_processes>
.passenger_max_pool_size <desired_app_processes>;
passenger_min_instances <desired_app_processes>;
passenger_pre_start
to have your app started automatically at web server boot.PassengerMaxPoolSize <desired_app_processes>
PassengerMinInstances <desired_app_processes>
PassengerPreStart
to have your app started automatically at web server boot.Multithreaded scenarios
In order to use multithreading you must use Phusion Passenger Enterprise. The open source version of Phusion Passenger does not support multithreading.
passenger start
with --max-pool-size=<CHOSEN_NUMBER_OF_PROCESSES> --min-instances=<CHOSEN_NUMBER_OF_PROCESSES> --concurrency-model=thread --thread-count=<desired_app_threads_per_process>
desired_app_processes
is 1, then you should also add --spawn-method=direct
. By using direct spawning instead of smart spawning, Phusion Passenger will not keep a Preloader process around, saving you some memory. This is because a Preloader process is useless when there's only 1 application process.passenger_max_pool_size <CHOSEN_NUMBER_OF_PROCESSES>;
passenger_min_instances <CHOSEN_NUMBER_OF_PROCESSES>;
passenger_concurrency_model thread;
passenger_thread_count <desired_app_threads_per_process>;
passenger_pre_start
to have your app started automatically at web server boot.desired_app_processes
is 1, then you should set passenger_spawn_method direct
. By using direct spawning instead of smart spawning, Phusion Passenger will not keep a Preloader process around, saving you some memory. This is because a Preloader process is useless when there's only 1 application process.PassengerMaxPoolSize <desired_app_processes>
PassengerMinInstances <desired_app_processes>
PassengerConcurrencyModel thread
PassengerThreadCount <desired_app_threads_per_process>
PassengerPreStart
to have your app started automatically at web server boot.desired_app_processes
is 1, then you should set PassengerSpawnMethod direct
. By using direct spawning instead of smart spawning, Phusion Passenger will not keep a Preloader process around, saving you some memory. This is because a Preloader process is useless when there's only 1 application process.Only if you're using the multithreaded concurrency model do you need to configure Rails. You need to enable thread-safety by setting config.thread_safe!
in config/environments/production.rb
. In Rails 4.0 this is on by default for the production environment, but in earlier versions you had to enable it manually.
You should also increase the ActiveRecord pool size because it limits concurrency. You can configure it in config/database.yml
. Set the pool
value to the number of threads per application process. But if you believe your database cannot handle that much concurrency, keep it at a low value.
Suppose you have:
Then the calculation is as follows:
# Use this formula for purely single-threaded multi-process deployments.
max_app_processes = (1024 * 0.75) / 150 = 5.12
desired_app_processes = max_app_processes = 5.12
Conclusion: you should use 5 or 6 processes. Phusion Passenger should be configured as follows:
# Standalone
passenger start --max-pool-size=5 --min-instances=5
# Nginx
passenger_max_pool_size 5;
passenger_min_instances 5;
# Apache
PassengerMaxPoolSize 5
PassengerMinInstances 5
However a concurrency of 5 or 6 is way too low if your application performs a lot of blocking I/O. You should use a multithreaded deployment instead, or you need to get more RAM so you can run more processes.
Suppose you have:
Then the calculation is as follows:
# Use this formula for purely single-threaded multi-process deployments.
max_app_processes = (1024 * 32 * 0.75) / 150 = 163.84
desired_app_processes = max_app_processes = 163.84
Conclusion: you should use 163 or 164 processes. This number seems high, but the value is correct. Because your app performs a lot of blocking I/O, you need a lot of I/O concurrency. The more concurrency the better. The amount of concurrency scales linearly with the number of processes, which is why you end up with such a large number.
Phusion Passenger should be configured as follows:
# Standalone
passenger start --max-pool-size=163 --min-instances=163
# Nginx
passenger_max_pool_size 163;
passenger_min_instances 163;
# Apache
PassengerMaxPoolSize 163
PassengerMinInstances 163
Note that in this example, 163-164 processes is merely the maximum number of processes that you can run, without overloading your RAM. It does not mean that you have enough concurrency for your application! If you need more concurrency, you should use a multithreaded deployment instead.
Consider the same machine as in example 2:
But this time you're using multithreading with 8 application processes (because you have 8 CPUs). How many threads do you need per process?
# Use this formula for multithreaded deployments.
max_app_threads_per_process
= ((1024 * 32 * 0.75) - (8 * 150)) / (150 / 10) / 8
= 194.8
Conclusion: you should use 195 threads per process.
# Standalone
passenger start --max-pool-size=8 --min-instances=8 --concurrency-model=thread --thread-count=195
# Nginx
passenger_max_pool_size 8;
passenger_min_instances 8;
passenger_concurrency_model thread;
passenger_thread_count 195;
# Apache
PassengerMaxPoolSize 8
PassengerMinInstances 8
PassengerConcurrencyModel thread
PassengerThreadCount 195
Because of the huge number of threads, this only works on a 64-bit platform. If you're on a 32-bit platform, consider lowering the number of threads while raising the number of processes. For example, you can double the number of processes (to 16) and halve the number of threads (to 779).
If you're using Nginx then it does not need additional configuration. Nginx is evented and already supports a high concurrency out of the box.
If you're using Apache, then prefer the worker MPM (which uses a combination of processes and threads) or the event MPM (which is similar to the worker MPM, but better) over the prefork MPM (which only uses processes) whenever possible. PHP requires prefork, but if you don't use PHP then you can probably use one of the other MPMs. Make sure you set a low number of processes and a moderate to high number of threads.
Because Apache performs a lot of blocking I/O (namely HTTP handling), you should give it a lot of threads so that it has a lot of concurrency. Apache's concurrency must be somewhat larger than the total number of application processes or total number of application threads. When considering example 3, the Apache concurrency must be larger than 8 * 1558 = 12464
.
If you cannot use the event MPM, consider putting Apache behind an Nginx reverse proxy, with response buffering turned on on the Nginx side. This reliefs a lot of concurrency problems from Apache. If you can use the event MPM then adding Nginx to the mix does not provide many advantages.
Phusion Passenger supports turbocaching since version 4. Turbocaching is an HTTP cache built inside Phusion Passenger. When used correctly, the cache can accelerate your app tremendously. To utilize turbocaching, you only need to set HTTP caching headers.
The first thing you should do is to learn how to use HTTP caching headers. It's pretty simple and straightforward. Since the turbocache is just a normal HTTP shared cache, it respects all the HTTP caching rules.
To activate the turbocache, the response must contain either an "Expires" header or a "Cache-Control" header.
The "Expires" header tells the turbocache how long to cache a response. Its value is an HTTP timestamp, e.g. "Thu, 01 Dec 1994 16:00:00 GMT".
The Cache-Control header is a more advanced header that not only allows you to set the caching time, but also how the cache should behave. The easiest way to use it is to set the max-age flag, which has the same effect as setting "Expires". For example, this tells the turbocache that the response is cacheable for at most 60 seconds:
Cache-Control: max-age=60
As you can see, a "Cache-Control" header is much easier to generate than an "Expires" header. Furthermore, "Expires" doesn't work if the visitor's computer's clock is wrongly configured, while "Cache-Control" does. This is why we recommend using "Cache-Control".
Another flag to be aware of is the private flag. This flag tells any shared caches -- caches which are meant to store responses for many users -- not to cache the response. The turbocache is a shared cache. However, the browser's cache is not, so the browser can still cache the response. You should set the "private" flag on responses which are meant for a single user, as you will learn later in this article.
And finally, there is the no-store flag, which tells all caches -- even the browser's -- not to cache the response.
Here is an example of a response which is cacheable for 60 seconds by the browser's cache, but not by the turbocache:
Cache-Control: max-age=60,private
The HTTP specification specifies a bunch of other flags, but they're not relevant for the turbocache.
The turbocache currently only caches GET requests. POST, PUT, DELETE and other requests are never cached. If you want your response to be cacheable by the turbocache, be sure to use GET requests, but also be sure that your request is idempotent.
The "Vary" header is used to tell caches that the response depends on one or more request headers. But the turbocache does not implement support for the "Vary" header, so if you output a "Vary" header then the turbocache will not cache your response at all. Avoid using the "Vary" header where possible.
The turbocache caches only responses that are at most 32 KB, including HTTP headers. This maximum size is currently not configurable, but we are working on it.
The turbocache caches a response for a maximum duration of 2 seconds, or whatever is specified as the expiry time according to the HTTP headers, whichever is earliest. The cap of 2 seconds is currently not configurable, but we are working on it.
Phusion Passenger supports out-of-band garbage collection for Ruby apps. With this feature enabled, Phusion Passenger can run the garbage collector in between requests, so that the garbage collector doesn't delay the app as much. Please refer to the Users Guide for more information about this feature.
In certain situations, using the builtin HTTP engine in Passenger Standalone may yield some performance benefits because it skips a layer of processing.
Passenger normally works by integrating into Nginx or Apache. As described in the Design & Architecture document, requests are first handled by Nginx or Apache, and then forwarded to the Passenger core process (the Passenger core) and the application process. This architecture provides various benefits, such as security benefits (Nginx and Apache's HTTP connection handling routines are thoroughly battle-tested and secure) and feature benefits (e.g. Gzip compression, superb static file handling).
This is even true if you use the Standalone mode. Although it acts standalone, it is implemented under the hood by running Passenger in a builtin Nginx engine.
However, the fact that all requests go through Nginx or Apache means that there is a slight overhead, which can be avoided. This overhead is small (much smaller than typical application and network overhead), and using Nginx or Apache is very useful, but in certain special situations it may be beneficial to skip this layer.
Nginx and Apache can be removed by using Passenger's builtin HTTP engine. By using this engine, Passenger will listen directly on a socket for HTTP requests, without using Nginx or Apache.
This builtin HTTP engine can be accessed by starting Passenger Standalone using the --engine=builtin
parameter, like this:
passenger start --engine=builtin
It should be noted that the builtin HTTP engine has fewer features than the Nginx engine, by design. For example the builtin HTTP engine does not support serving static files, nor does it support gzip compression. Thus, we recommend using the Nginx engine in most situations, unless you have special needs such as documented above.
ab
because it's slow and buggy.siege
and httperf
because they cannot utilize multiple CPU cores.builtin
engine. This is the default.