So among the myriad of tech choices we’ve made over the last couple weeks – the choice of a web framework is probably one of the most important (behind a client development framework). There’s a few criteria we can imagine using in making such a selection – ergonomics, community/hire-ability and performance – the last of which is the focus of this post.
There’s quite a few benchmarks out there for the current generation of web technologies, but I’ve found they generally tend to shy away putting a real workload behind the benchmark – or if they do, they tend to focus more around trivial or computationally intensive workloads – and less around workloads that mirror real consumption. This is likely because workloads tend to align to specific verticals (gaming, photo sharing, etc) and developers of web frameworks tend to target generality and flexibility. This is not really a bad thing, but as a software developer – you want to know if this framework is going to do your thing efficiently.
So, in this post we examine how well the current generation of web frameworks do with workloads that a social game is likely to encounter.
Obviously there are many flavors of games out there – and many ways to write the backend of those games – so we’ll limit the discussion to blob based, sandbox games that are mostly single player in focus. The workload for these games is generally characterized by
- Fetching a document for a player from authoritative storage – such as memcache, couchebase or in our case Redis)
- Decompressing that document using some strategy – like Gzip or LZO
- De-serializing that document from its persistent representation – JSON decoding or using some other scheme
- Doing some amount of game logic on that player state – This is usually mildly CPU intensive
- Re-serializing player state into its persistent representation – JSON encoding or similar
- Compressing the resulting document – Again, with Gzip or LZO
- Sending that document back to persistent storage.
The entire operation is usually controlled by some distributed synchronization scheme – such as CAS or Surrogate memcache keys – and these can indeed cause additional load – but this is invariant to web framework chosen, so we’ll leave it out of this discussion.
Players will post up transactions to their game state on a semi regular interval while logged into the app (say 10-20s), and will do so by posting HTTP requests up to our web server.
When profiled, this workload produces a walltime breakdown that looks like
- 20% Fetching Records
- 20% Decompressing /Deserializing Documents
- 20% Game Logic
- 20% Compressing / Serializing Documents
- 20% Saving Records
This creates a workload thats roughly 40% I/O bound, 60% CPU bound, and on the whole is probably a bit more I/O rich than other web applications – still it’s my hope that these results are atleast directionally relevant to what you may have going on
For this exercise, we’ll define two representative game workloads and then benchmark 7 common web frameworks against them.
Workload 1 – 40% I/O
This workload represents a plausible social game workload for a player logged into our application. Pseudo code looks roughly like this
// Fetch a known gzipped JSON document from redis.
// Unzip the document
// JSON decode the document
// Do 10k FLOPs - rough simulation of game logic
// JSON encode the document
// Rezip the document
// Send it back to storage
The document we’ll use for the benchmark represents an elder player blob in our game – and is pulled from another popular social web title, obfuscated but complexity and decode speed stays the same.
This player state is 100k uncompressed, 10k gzipped and represents (I think) pretty reasonable player state for a player.
Workload 2 – 80% I/O
Early in testing frameworks with Workload 1 – we found that the zlib or JSON implementation present in each framework strongly dictated the performance we were getting – they became bottlenecks. To provide another lens for looking at these frameworks, we came up with another workload which – while NOT plausible for a live game server – gave a better sense of how these frameworks dealt with an I/O heavy workload – without being clouded by specific library implementations which can always be optimized.
The pseudo code for this workload looks something like the following
// Fetch a document from storage
// Do 10k FLOPs
// Save the document back to storage (retaining compression)
This produces a workload thats something closer to 80% I/O
We tested a total of 7 web frameworks as part of this – a mix and match of older “blocking” frameworks such as PHP and Ruby and newer Async frameworks such as Node, Go and Netty.
- PHP 5.3.3 running on Prefork Apache 2.2.15
- Node JS 0.8.14 with Cluster running with Nginx local proxy
- Netty 3.5 with Nginx local proxy
- Go 1.0.3
- Mono 3.0.1 with Mod_mono running on Prefork Apache 2.2.15
- Ruby 1.8.7 with Thin server 1.4.1 with Nginx local proxy
- JRuby 1.7.0 with Trinidad 1.4.4
I’ve tried to set these up with generally available best practices for each framework – some notes on configuration nitty gritty below for anyone who’s interested.
Another gripe I’ve had with some of the other benchmarks I’ve seen posted is they frequently tend to be run against a localhost server. While this might be useful directionally, its a pretty dirty environment – subject to whatever else your personal machine happens to be doing at the time (including throwing load). So for these benchmarks – we’ve chosen to run three dedicated machines; a microcosm of what we anticipate running in production:
We thought about bumping up the throw box up from an m1.small to a “C” series box to generate more load on the web tier – but an M1.Small was more than capable of generating sufficient load (this is because the bulk of the threads are tied up in I/O anyways).
For the throw box, we’re using Apache Bench. This is actually my first time using AB (I usually write my own little scripts to bench) – but holy crap is this tool handy. For each instance of the test, we ran AB from the throw box and took a reading over the first 10,000 requests.
Again – these were run on a c1.xlarge box (8 cores) and represent a reading from the first 10,000 requests (with concurrency noted).
Workload 1 – Mean Response Time (Smaller is Better)
Workload 1 – Median Response Time (Smaller is Better)
Workload 1- 80th Percentile Response Time (Smaller is Better)
Workload 1 – Requests per Second (Bigger is Better)
Workload 2 – Mean Response Time (Smaller is Better)
Workload 2 – Median Response Time (Smaller is Better)
Workload 2- 80th Percentile Response Time (Smaller is Better)
Workload 2 – Requests per Second (Bigger is Better)
The raw results are also available here (includes other percentiles that may interest some folks)
So a few conclusions
- There’s a marked difference in RPS and response time for new async frameworks and sync frameworks – this becomes especially obvious as the number of concurrent requests exceeds the number of physical cores on the box. Async frameworks achieve a peak RPS at above concurrent workers
- Workloads demonstrating a richer I/O workload (as a % of wall time) get larger benefits from using an async framework. Workloads with a pure CPU workload should demonstrate near identical performance on both sync and async frameworks.
- Go achieves very terrible scores on workload 1 and really excellent scores on workload 2 – I think this is pointing to particularly crappy JSON and ZLib implementations present in Go.
- Mono 3.0.1’s implementation of .NET 4.5 is super disappointing and really not ready for primetime (async keyword isn’t implemented yet).
- It’s a giant mistake to assume that system util libraries demonstrate good/best performance. The best JSON lib can be 10x faster than the worst.
- There appears to be an emerging trend of decoupling a web framework from it’s server container (e.g. having a local nginx proxy). This introduces some new complexity in managing both processes and making sure that the guest web framework keeps its service and workers in a healthy state.
Below are some rough notes on how we configured each of the frameworks
- apc.stat off
- max mem 32MB
- apache prefork
- StartServers – 8
- MinSpareServers – 5
- MaxSpareServers – 20
- ServerLimit – 256
- MaxClients – 256
- MaxRequestsPerChild – 4000
- Predis v0.8
- Using cluster module – 16 workers (2x CPU).
- Had to build Node from src – required juggling python versions.
- Single node instance for NGINX Proxy
- Using node_redis client w/ async connections. Connection formed prefork
- Noticed no performance difference w/ and w/o using hiredis (high performance C extension for node redis serialization)
- Performance was absolutely atrocious until we swapped out the JSON Lib (json-simple) for Jackon’s ObjectMapper. This brought RPS for 35 to 300+ – a 10x increase
- Using the lettuce redis client framework (the only that seemed to support an async connection). Had to compile this from src.
- Using a singleton connection (similar to node)
- Netty’s framework is very arcane – this example by far took me the longest to implement
- Used simple forking model w/ 2x CPU workers. No local proxy
- Used out of the box JSON/ZLib implementations – These appear to be unoptimized.
- Used go-redis library – no issues
- Had to build Mono (3.0.1) from src.
- Mod Mono Server 2.5 hasn’t been updated to run 4.5 code – tried running a local proxy to XSP server (which can run .NET 4.5 code)
- Async keyword not implemented yet – ran into this when trying to setup an Async Httpd Handler
- Benchmark represents fully sync implementation. Ran Mod Mono on Apache both prefork and using the worker MPM. Worker MPM (the recommended configuration) actually yielded crappier performance for this workload
- Used ServiceStack implementation – no issues here.
- Initially started w/ rails based app – then realized webrick isn’t meant to be a production server
- Tried to get mongrel cluster setup and failed – cut over to thin server
- Thin server is pretty good – I wound up cutting out all the rails stuff and went with a simple shim app (rails I suspect was likely going to add some amount of overhead)
- Setup 16 instances of thin server and configured nginx to round robin proxy to each of them
- Used redis-rb client with no issues
- No major instances setting up JRuby ontop of an existing Ruby installation other than understanding the gems are managed distinctly
- Used above ruby configuration except for thinserver
- Used Trinidad server instead of thin server here – expect single JVM instance schedules threads better than the Frankenstein NGINX proxy + ThinServer instances
We’ve made the code available for anyone who’d like to compare resuts – super interested in if anyone has a killer tweak that will make one of these frameworks change performance characteristics substantially for these workloads