How To Optimize Nginx Configuration for HTTP/2 TLS (SSL)

October 12, 2015August 4, 2016 Leandro Moreira developer, distributed systems, unix distributed system, high scalability

TLDR;

http/2 over tls with nginx is already a reality, how can we achieve the best performance of it? check the example configuration.

Introduction

We all know that http/2 is right here and although it doesn’t impose the TLS usage, the major browsers already took their side (a.k.a only supporting http/2 over TLS).

The support for http/2 was released with nginx 1.9.5 (except for “Server Push”). But isn’t HTTPS a lot slower than good old HTTP? Well, this is not easy to answer but we can fine tune nginx to do much better than the default configuration.

I really believe that the biggest fight is against latency not CPU load, the tips you’ll see here are mostly about reducing RTT in order to decrease latency.

tls ssl does not need cpu

Before we move on to the practical tips, let’s revise the simple tasks you must to do first:

upgrade to the latest kernel (3.7+)
upgrade to the latest openssl (1.0.1j)
upgrade to the latest nginx (1.9.5)

These tips above will get you a lot of improvements but let’s go to the optimization tips:

TLS session resumption

What?

When you want to use HTTPS, your browser needs to negotiate the session (certificate, cipher, hash algorithm, tls version, key …), in a very simplistic way it does follow the steps:

Establish a TCP connection (SYN, SYN/ACK, ACK)
Negotiate and establish the TLS session

When you leave the site and come back later, the browser will need to renegotiate the session. TLS session resumption is the technique to partially skip this negotiation by persisting the session for later usage.

The left graph represents an over simplified version of a full TLS handshake (skipping TCP handshake) and on the right side you can see how TLS resumption works, the point is to skip RTT.

tls negotiation tls negotiation resumption

Why?

If we skip part of the session negotiation we’ll delivery fast content.

How?

We do have two ways of solving this issue: saving the session (TLS) on server (session cache) or preferable on client (session ticket).

session cache

ssl_session_cache   shared:SSL:10m;
ssl_session_timeout 1h;

In this case when client tries to reconnect, the server will try to recovery past persisted session skipping partially the negotiation. With this shared session (of 10m), nginx will be able to handle 10 x 4000 sessions and the sessions will be valid for 1 hour.

However, there are problems with this approach:

sessions are stored on the server;
the “shared session” is saved in each server, so multiple nginx’s will not share the same session;

For the second problem, the great project openresty is about to release a new feature (ssl_session_store_by_lua) which will enable us to save these sessions in a “central” repository (like redis).

session ticket

# $> openssl rand 48 > file.key
ssl_session_tickets on; 
ssl_session_ticket_key file.key;

In this case, the server will create a ticket and send it to the client, when the client tries to connect again it’ll use the ticket and the server will just resume the session.

Nginx comes with session tickets enabled by default but if you will deploy your application in more than one box (bare metal, cloud, virtual machines, containers …) you’ll also need to specify the same key (used to create the tickets), you should rotate this key often.

Although this approach is much better than session cache, not all browsers support this so you might need to offer both solutions.

TLS false start

What?

How about to have the same benefit (skipping RTT) as in TLS resumption but when the browser first negotiate with the server?! This is possible by using and enforcing the forward secrecy.

Instead of waiting the last handshake step from server, the browser will already send the data (request) and the server will reply with the data (response). This technique is known as TLS false start.

tls false start

Why?

Less RTT means faster site/video/image/data to final users.

How?

This is possible because we can extend the tls protocol by instrument it with specific ciphers.

ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:ECDHE-RSA-DES-CBC3-SHA:ECDHE-ECDSA-DES-CBC3-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';

OCSP stapling and certificate chain

What?

Creating trust is a hard task to do and part of the responsibilities of TLS is to enforce it. In order to establish this trust the browser (or you OS) needs to at least have one point to trust.

In a very simplistic way, your browser believes that http://www.example.com is which it claims to be based on a chain of trust, it checks by:

looking at its certificate and then checking if its signature is valid (checking all the certificates until the ROOT)
looking if the certificate is not revoked by either searching it in CRL (certificate revocation list) or issuing a new request OCSP (Online Certificate Status Protocol)

tls chain

Both steps might force your browser to do more RTT, if your server doesn’t provide the intermediate certificates the browser will need to download them and it might even request an OSCP (requiring more RTT: DNS, TCP_HANDSHAKE, TLS_HANDSHAKE).

Why?

Less RTT means faster site/video/image/data to final users 2.

How?

You can concatenate your certificate with its chain (except ROOT, it’s not necessary, in fact the browser won’t trust your ROOT) then avoiding extra RTT to download the missing certificates.

$ cat mysite.cert ca1.cert > full.cert
...
ssl_certificate /path/to/full.cert

You can set up nginx to avoid OCSP by stapling (this “staple” is digitally signed which makes possible for your browser to check its authenticity) the OCSP response on your server then you will avoid extra RTT for OCSP.

ssl_stapling on; 
ssl_stapling_verify on; 
ssl_trusted_certificate /path/to/cas.pem

TLS record size optimization

What?

Although the maximum size of a TCP packet is 64K, a new TCP connection starts with much less than this maximum.

And each TLS record can hold at maximum 16K (which is the default size for nginx), summing up this size plus the headers of tcp and ip the server might need to make 2 RTT to serve the first bytes. And that’s not cool.

TCP is great but it has limitations, it is not ideal for all kinds of applications and there is even “quic” efforts to make web faster with experimentations using UDP instead of TCP.

Since we’ve reached our current speed limits, light speed, (who knows what “quantum” can do) we’re moving to avoid extra RTT.

*you can’t use QUIC on nginx yet.

Why?

Less RTT means faster site/video/image/data to final users. 🙂 Again!!!

How?

There is a tradeoff here, you can either chose throughput (TLS record size to max) or latency (a small record size). It would be great if nginx could offer an adaptive option, starting small (4K, to speed up the first bytes) and after 1 minute or 1MB it increases for 16K.

ssl_buffer_size 16k; #for throughput, video applications
#ssl_buffer_size 4k; for quick first byte delivery

HSTS (HTTP Strict Transport Security)

What?

HSTS “converts” your site to a strict HTTPS-only, it eliminates unnecessary HTTP-to-HTTPS redirects by shifting this responsibility to the client, most of the browsers support it. Even if you forgot to change http for https the browser will do that for you.

Why?

Redirects means more RTT, yeah I know it’s getting repetitive but it’s to reduce latency.

How?

A simple http header to instruct the browser.

add_header Strict-Transport-Security "max-age=31536000; includeSubdomains"; 
#'max-age' values specifies how long browser should follow this rule.

Why Chrome doesn’t show/accept http2?

Users of the Google Chrome web browser are seeing some sites that they previously accessed over HTTP/2 falling back to HTTP/1. You can check what, why and how at nginx site.

Summary

It’s all about making web faster avoiding RTT (later I’ll post tips specific to http/2), so here’s a check list:

Upgrade to the latest: kernel, openssl and nginx.
Use TLS resumption and TLS false start
`cat` your certificate with the intermediates
Think about the best size (hard) for you TLS record
Enforce HSTS 😉

Here’s the full config example:

# command to generate dhparams.pen
# openssl dhparam -out /etc/nginx/conf.d/dhparams.pem 2048

limit_conn_zone $binary_remote_addr zone=conn_limit_per_ip:10m;
limit_req_zone $binary_remote_addr zone=req_limit_per_ip:10m rate=5r/s;
limit_req_status 444;
limit_conn_status 503;

proxy_cache_path /var/lib/nginx/proxy levels=1:2 keys_zone=backcache:8m max_size=50m;
proxy_cache_key "$scheme$request_method$host$request_uri$is_args$args";
proxy_cache_valid 404 1m;

upstream app_server {
  server unix:/tmp/unicorn.myserver.sock fail_timeout=0;
}

server {
  listen 80;
  server_name *.example.com;
  limit_conn conn_limit_per_ip 10;
  limit_req zone=req_limit_per_ip burst=10 nodelay;
  return 301 https://$host$request_uri$is_args$args;
}

server {
  listen 443;
  server_name _;

  limit_conn conn_limit_per_ip 10;
  limit_req zone=req_limit_per_ip burst=10 nodelay;

  ssl on;

  ssl_stapling on;
  ssl_stapling_verify on;
  ssl_trusted_certificate /etc/nginx/conf.d/ca.pem;

  ssl_certificate /etc/nginx/conf.d/ssl-unified.crt;
  ssl_certificate_key /etc/nginx/conf.d/private.key;
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
  ssl_ciphers 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA';
  ssl_dhparam /etc/nginx/conf.d/dhparams.pem;
  ssl_prefer_server_ciphers on;
  ssl_session_cache shared:SSL:10m;
  ssl_session_timeout 10m;

  root /home/deployer/apps/example.com/current/public;

  gzip_static on;
  gzip_http_version   1.1;
  gzip_proxied        expired no-cache no-store private auth;
  gzip_disable        "MSIE [1-6]\.";
  gzip_vary           on;

  client_body_buffer_size 8K;
  client_max_body_size 20m;
  client_body_timeout 10s;
  client_header_buffer_size 1k;
  large_client_header_buffers 2 16k;
  client_header_timeout 5s;

  add_header Strict-Transport-Security "max-age=31536000; includeSubdomains"; 

  keepalive_timeout 40;

  location ~ \.(aspx|php|jsp|cgi)$ {
    return 404;
  }

  location ~* ^/assets/ {
    root /home/deployer/apps/example.com/current/public;
    # Per RFC2616 - 1 year maximum expiry
    # http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
    expires 1y;
    add_header Cache-Control public;
    access_log  off;
    log_not_found off;

    # Some browsers still send conditional-GET requests if there's a
    # Last-Modified header or an ETag header even if they haven't
    # reached the expiry date sent in the Expires header.
    add_header Last-Modified "";
    add_header ETag "";
    break;
  }

  try_files $uri $uri/index.html $uri.html @app;

  location @app {
    proxy_set_header X-Url-Scheme $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    # enable this if you forward HTTPS traffic to unicorn,
    # this helps Rack set the proper URL scheme for doing redirects:
    proxy_set_header X-Forwarded-For-Forwarded-Proto $https;

    proxy_set_header Host $host;
    proxy_redirect off;
    proxy_pass http://app_server;
  }

  error_page 500 502 503 504 /500.html;
  location = /500.html {
    root /home/deployer/apps/example.com/current/public;
  }
}

With this configuration I was able to get an A+ at SSLlabs, this is a useful tool where you can check what you need to do to make your SSL site better, it gives you specific tips (apache, nginx, IIS).

I strongly recommend you:

to watch Ilya Grigorik’s presentation @ nginx.conf 2014;
read TLS is fast yet?;
read the amazing book High Performance Browser Networking;
to understand more about these ciphers and perfect secrecy go to Mozilla tls

FIFA 2014 World Cup live stream architecture

April 26, 2015April 27, 2015 Leandro Moreira agile, distributed systems distributed system, fifa 2014 world cup, high scalability

live_stream_nginx We were given the task to stream the FIFA 14 World Cup and I think this was an experience worth sharing. This is a quick overview about: the architecture, the components, the pain, the learning, the open source and etc.

The numbers

GER 7×1 BRA (yeah, we’re not proud of it)
0.5M simultaneous users @ a single game – ARG x SUI
580Gbps @ a single game – ARG x SUI
=~ 1600 watched years @ the whole event

The core overview

The project was to receive an input stream, generate HLS output stream for hundreds of thousands and to provide a great experience for final users:

Fetch the RTMP input stream
Generate HLS and send it to Cassandra
Fetch binary and meta data from Cassandra and rebuild the HLS playlists with Nginx+lua
Serve and cache the live content in a scalable way
Design and implement the player

If you want to understand why we chose HLS check this presentation only in pt-BR. tip: sometimes we need to rebuild some things from scratch.

The input

The live stream comes to our servers as RTMP and we were using EvoStream (now we’re moving to nginx-rtmp) to receive this input and to generate HLS output to a known folder. Then we have some python daemons, running at the same machine, watching this known folder and parsing the m3u8 and posting the data to Cassandra.

To watch files modification and to be notified by these events, we first tried watchdog but for some reason we weren’t able to make it work as fast as we expected and we changed to pyinotify.

Another challenge we had to overcome was to make the python program scale to x cpu cores, we ended up by creating multiple Python processes and using async execution.

tip: maybe the best language / tool is in another castle.

The storage

We previously were using Redis to store the live stream data but we thought Cassandra was needed to offer DVR functionality easily (although we still uses Redis a lot). Cassandra response time was increasing with load to a certain point where clients started to timeout and the video playback completely stopped.

We were using it as Queue-like which turns out to be a anti-pattern. We then denormalized our data and also changed to LeveledCompactionStrategy as well as we set durable_writes to false, since we could treat our live stream as ephemeral data.

Finally, but most importantly, since we knew the maximum size a playlist could have, we could specify the start column (filtering with id > minTimeuuid(now – playlist_duration)). This really mitigated the effect of tombstones for reads. After these changes, we were able to achieve a latency in the order of 10ms for our 99% percentile.

tip: limit your queries + denormalize your data + send instrumentation data to graphite + use SSD.

The output

With all the data and meta-data we could build the HLS manifest and serve the video chunks. The only thing we were struggling was that we didn’t want to add an extra server to fetch and build the manifests.

Since we already had invested a lot of effort into Nginx+Lua, we thought it could be possible to use lua to fetch and build the manifest. It was a matter of building a lua driver for Cassandra and use it. One good thing about this approach (rebuilding the manifest) was that in the end we realized that we were almost ready to serve DASH.

tip: test your lua scripts + check the lua global vars + double check your caching config

The player

In order to provide a better experience, we chose to build Clappr, an extensible open-source HTML5 video player. With Clappr – and a few custom extensions like PiP (Picture In Picture) and Multi-angle replays – we were able to deliver a great experience to our users.

tip: open source it from day 0 + follow to flow issue -> commit FIX#123

The sauron

To keep an eye over all these system, we built a monitoring dashboard using mostly open source projects like: logstash, elastic search, graphite, graphana, kibana, seyren, angular, mongo, redis, rails and many others.

tip: use SSD for graphite and elasticsearch

The bonus round

Although we didn’t open sourced the entire solution, you can check most of them:

Discussion / QA @ HN

Tips for performance in your web sites

July 17, 2011July 17, 2011 Leandro Moreira high scalability high scalability

HTTP

HTTP is a networking protocol for distributed, collaborative, hypermedia information systems and it is also the foundation of data communication for the World Wide Web.

How it works?

You send (using a browser, for example) a request for the server: -Hey Internet, give me the Ruby overview web site.

GET /2011/07/11/ruby-overview.html
HTTP/1.1
Host: leandromoreira.com.br
User-Agent: Mozilla/5.0 (s.o.) Gecko/20100101 Firefox/5.0

And server can answer you: – Okay!

HTTP/1.1 200 OK
Content-Type: plaintext/html
Last-Modified: Fri, 15 Jul 2011 00:23:15 GMT
Content-Lenght: 896
blah blah content blah blah ruby blah blah

As you can see HTTP is a protocol where you can request resources for the sever, using some fields. This conversation between you and the server can results in further requests. Using some of HTTP fields, you can request compressed data from server (then use less network load), server can inform you how long you can keep components at cache.

The 14 golden rules

Steve Souders wrote the amazing book (based on his researches on Yahoo!) High Performance Web Sites to help us made more scalable and fast web sites. This book shows fourteen rules at practice to follow and achieve the fast web site, it also presents real web sites examples. I certainly recommended you this book. (Thank Guilherme Motta for this recommendation.)

Rule #01 – Make fewer HTTP requests

Yeah, this rule might be obvious but not so obvious is the possible solutions for that:

Use image map and css sprites instead of use multiply images.
Hard core -> Sometimes you can use inline images. (it doesn’t work with IE)
Combine & Minify your files: three js scripts to only one (with removed spaces and etc.) and nine stylesheets to one. In fact this thing of combine everything into one, broke our so beloved coupling OO rule. I think this combine and minify process should be made on deployment and delivery, this let’s free to create cohese and uncoupled scripts and stylesheets.

Rule #02 – Use Content Delivery Networks

I’ve worked in a company where we have some stylesheets and js shared between the projects. Our approach for that was, each project injects the files in its workspace. This take us to the hell of outdate files.

Thinking in terms of caching, if we had this files into a single known server, all the requests for our several web sites would take advantage of cache.

Taking this for a big picture, I mean dream that we are handling a huge web site, we could also take the advantage of customer’s proximity. If our components are next to our clients so the download time would be decreased.

Rule #03 – Add an Expires Header

In HTTP conversation, the server can tell to a client that certain component could be used at his local cache for a certain time using the HTTP field: Expires. You request a component for server and it answers with that component and give you a valid time for that component. The setup for this is made on server-side. (See also: Max-age and Cache-Control – overcome the limitation of Expires, when you need to specify the exact date, causing clock synchronizing issues. Using Cache-Control you can set time in seconds!)

People, usually, don’t put this because they fear the change. For example: I create a component (company’s logo) and put the expires date to ten days, but if I would change it before that time?! Well, you could starting writing your components with rev. number on its name. The company_logo_1.0.png is cached but your newer version company_logo.1.1.png isn’t.

Rule #04 – Gzip ’em all

Once again, you can inform to server that you understand and accept compressed files (Accept-Encoding: gzip) and maybe server answer you with compressed data. (Content-Enconding: gzip) Using compressing, saves the average of 66% in your network flow, it’s huge and worth to do.

Rule #05 – Stylesheets at top

The history is long (progressive rendering, browser way to render, blank screen when css at bottom) so just follow (and read the book to understand the beginning of this long history) the rule! BTW, by testing, it was proved that use LINK way of include css instead @import is better.

Rule #06 – Scripts at bottom

Again, long history (browser way to understand it, EVEN BLOCK parallel downloads because of scripts and middle of your html) just follow the rule!

Rule #07 – Don’t use CSS expression (okay, avoid)

This rule brings a new concept to me. I didn’t know that we could write css and javascritp into css rules (in this way). For example:

width: expression( document.body.clientWith < 600 ? "600px" : "auto" );

Rule #08 – Make JS & CSS external

Inline vs External -> In raw terms inline is faster. But don’t forget about others rules : compressing, CDN and caching. In general, for real world projects external files always win for performance. (Mainly for users with primed cache, components cached because they’re visited site before)

Rule #09 – Reduce DNS lookups

The roundtrip that one request does to find the it can be a huge time-consuming, so DNS caching would improve a lot your user first visit.

Rule #10 – Minify Javascript

I already cite this but it’s a valid rule, imagine the space you can save and bytes you can save by minifying your js.


myFunction(parameter0, paramenter1){

window.title = parameter0 + parameter1;

}

Works equality as:


myFunction(_0,_1){window.title=_0+_1;}

There is a bunch of minifyers on Internet. And even obsfucators but I don’t think it’s necessary and most of them can introduce bugs. There is also CSS minifyers too: write 10 instead of 10px, #606 instead of #660066.

Rule #14 – Make ajax cacheable

This is, maybe, the most hard rule to explain and apply! Then it’s better read it here.

PS: I skipped the rules 11 (avoid redirects), 12(remove duplicate scripts) and 13 (configure etags) just because they are most known for general people, except ETags but usually the tip is : avoid ETag, so it is.

Oh yes, you can see more rules of high performance at Yahoo! and at Google.com too.

Useful tools to help mensure your site

Today, we fortunately have tools to help us identify and fix performance issues on our web sites. You can simply grab one of that bellow or both and run it on your site. They give you a complete report about the pain points and thing you can do to increase performance. YSlow uses the fourteen rules as basis and Speed Trace is Google tool, it doesn’t need to say anything.

YSlow

YSlow analyzes web pages and suggests ways to improve their performance based on a set of rules for high performance web pages. It can be installed as browser plugin.

Speed Trace

Speed Tracer is a tool to help you identify and fix performance problems in your web applications. It visualizes metrics that are taken from low-level instrumentation points inside of the browser and analyzes them as your application runs. Speed Tracer is available as a Chrome extension and works on all platforms where extensions are currently supported (Windows and Linux).