We were given the task to stream the FIFA 14 World Cup and I think this was an experience worth sharing. This is a quick overview about: the architecture, the components, the pain, the learning, the open source and etc.
- GER 7×1 BRA (yeah, we’re not proud of it)
- 0.5M simultaneous users @ a single game – ARG x SUI
- 580Gbps @ a single game – ARG x SUI
- =~ 1600 watched years @ the whole event
The core overview
The project was to receive an input stream, generate HLS output stream for hundreds of thousands and to provide a great experience for final users:
- Fetch the RTMP input stream
- Generate HLS and send it to Cassandra
- Fetch binary and meta data from Cassandra and rebuild the HLS playlists with Nginx+lua
- Serve and cache the live content in a scalable way
- Design and implement the player
If you want to understand why we chose HLS check this presentation only in pt-BR. tip: sometimes we need to rebuild some things from scratch.
The live stream comes to our servers as RTMP and we were using EvoStream (now we’re moving to nginx-rtmp) to receive this input and to generate HLS output to a known folder. Then we have some python daemons, running at the same machine, watching this known folder and parsing the m3u8 and posting the data to Cassandra.
To watch files modification and to be notified by these events, we first tried watchdog but for some reason we weren’t able to make it work as fast as we expected and we changed to pyinotify.
Another challenge we had to overcome was to make the python program scale to x cpu cores, we ended up by creating multiple Python processes and using async execution.
tip: maybe the best language / tool is in another castle.
We previously were using Redis to store the live stream data but we thought Cassandra was needed to offer DVR functionality easily (although we still uses Redis a lot). Cassandra response time was increasing with load to a certain point where clients started to timeout and the video playback completely stopped.
We were using it as Queue-like which turns out to be a anti-pattern. We then denormalized our data and also changed to LeveledCompactionStrategy as well as we set durable_writes to false, since we could treat our live stream as ephemeral data.
Finally, but most importantly, since we knew the maximum size a playlist could have, we could specify the start column (filtering with id > minTimeuuid(now – playlist_duration)). This really mitigated the effect of tombstones for reads. After these changes, we were able to achieve a latency in the order of 10ms for our 99% percentile.
tip: limit your queries + denormalize your data + send instrumentation data to graphite + use SSD.
With all the data and meta-data we could build the HLS manifest and serve the video chunks. The only thing we were struggling was that we didn’t want to add an extra server to fetch and build the manifests.
Since we already had invested a lot of effort into Nginx+Lua, we thought it could be possible to use lua to fetch and build the manifest. It was a matter of building a lua driver for Cassandra and use it. One good thing about this approach (rebuilding the manifest) was that in the end we realized that we were almost ready to serve DASH.
tip: test your lua scripts + check the lua global vars + double check your caching config
In order to provide a better experience, we chose to build Clappr, an extensible open-source HTML5 video player. With Clappr – and a few custom extensions like PiP (Picture In Picture) and Multi-angle replays – we were able to deliver a great experience to our users.
tip: open source it from day 0 + follow to flow issue -> commit FIX#123
To keep an eye over all these system, we built a monitoring dashboard using mostly open source projects like: logstash, elastic search, graphite, graphana, kibana, seyren, angular, mongo, redis, rails and many others.
tip: use SSD for graphite and elasticsearch
The bonus round
Although we didn’t open sourced the entire solution, you can check most of them:
Discussion / QA @ HN