Benchmarking Web Frameworks for Games

Background

So among the myriad of tech choices we’ve made over the last couple weeks – the choice of a web framework is probably one of the most important (behind a client development framework). There’s a few criteria we can imagine using in making such a selection – ergonomics, community/hire-ability and performance – the last of which is the focus of this post.

There’s quite a few benchmarks out there for the current generation of web technologies, but I’ve found they generally tend to shy away putting a real workload behind the benchmark – or if they do, they tend to focus more around trivial or computationally intensive workloads – and less around workloads that mirror real consumption. This is likely because workloads tend to align to specific verticals (gaming, photo sharing, etc) and developers of web frameworks tend to target generality and flexibility. This is not really a bad thing, but as a software developer – you want to know if this framework is going to do your thing efficiently.

So, in this post we examine how well the current generation of web frameworks do with workloads that a social game is likely to encounter.

Game Workload

Obviously there are many flavors of games out there – and many ways to write the backend of those games – so we’ll limit the discussion to blob based, sandbox games that are mostly single player in focus. The workload for these games is generally characterized by

  • Fetching a document for a player from authoritative storage – such as memcache, couchebase or in our case Redis)
  • Decompressing that document using some strategy – like Gzip or LZO
  • De-serializing that document from its persistent representation – JSON decoding or using some other scheme
  • Doing some amount of game logic on that player state – This is usually mildly CPU intensive
  • Re-serializing player state into its persistent representation – JSON encoding or similar
  • Compressing the resulting document – Again, with Gzip or LZO
  • Sending that document back to persistent storage.

The entire operation is usually controlled by some distributed synchronization scheme – such as CAS or Surrogate memcache keys – and these can indeed cause additional load – but this is invariant to web framework chosen, so we’ll leave it out of this discussion.

Players will post up transactions to their game state on a semi regular interval while logged into the app (say 10-20s), and will do so by posting HTTP requests up to our web server.

When profiled, this workload produces a walltime breakdown that looks like

  • 20% Fetching Records
  • 20% Decompressing /Deserializing Documents
  • 20% Game Logic
  • 20% Compressing / Serializing Documents
  • 20% Saving Records

This creates a workload thats roughly 40% I/O bound, 60% CPU bound, and on the whole is probably a bit more I/O rich than other web applications – still it’s my hope that these results are atleast directionally relevant to what you may have going on :)

For this exercise, we’ll define two representative game workloads and then benchmark 7 common web frameworks against them.

Workload 1 – 40% I/O

This workload represents a plausible social game workload for a player logged into our application. Pseudo code looks roughly like this

// Fetch a known gzipped JSON document from redis.
// Unzip the document
// JSON decode the document
// Do 10k FLOPs - rough simulation of game logic
// JSON encode the document
// Rezip the document
// Send it back to storage

The document we’ll use for the benchmark represents an elder player blob in our game – and is pulled from another popular social web title, obfuscated but complexity and decode speed stays the same.

https://github.com/juiceboxgames/web-bench/blob/master/util/document.json

This player state is 100k uncompressed, 10k gzipped and represents (I think) pretty reasonable player state for a player.

Workload 2 – 80% I/O

Early in testing frameworks with Workload 1 – we found that the zlib or JSON implementation present in each framework strongly dictated the performance we were getting – they became bottlenecks. To provide another lens for looking at these frameworks, we came up with another workload which – while NOT plausible for a live game server – gave a better sense of how these frameworks dealt with an I/O heavy workload – without being clouded by specific library implementations which can always be optimized.

The pseudo code for this workload looks something like the following

// Fetch a document from storage
// Do 10k FLOPs
// Save the document back to storage (retaining compression)

This produces a workload thats something closer to 80% I/O

Frameworks Tested

We tested a total of 7 web frameworks as part of this – a mix and match of older “blocking” frameworks such as PHP and Ruby and newer Async frameworks such as Node, Go and Netty.

  • PHP 5.3.3  running on Prefork Apache 2.2.15
  • Node JS 0.8.14 with Cluster running with Nginx local proxy
  • Netty 3.5 with Nginx local proxy
  • Go 1.0.3
  • Mono 3.0.1 with Mod_mono running on Prefork Apache 2.2.15
  • Ruby 1.8.7 with Thin server 1.4.1 with Nginx local proxy
  • JRuby 1.7.0 with Trinidad 1.4.4

I’ve tried to set these up with generally available best practices for each framework – some notes on configuration nitty gritty below for anyone who’s interested.

Methodology

Another gripe I’ve had with some of the other benchmarks I’ve seen posted is they frequently tend to be run against a localhost server. While this might be useful directionally, its a pretty dirty environment – subject to whatever else your personal machine happens to be doing at the time (including throwing load). So for these benchmarks – we’ve chosen to run three dedicated machines; a microcosm of what we anticipate running in production:

Benchmark Server Layout

We thought about bumping up the throw box up from an m1.small to a “C” series box to generate more load on the web tier – but an M1.Small was more than capable of generating sufficient load (this is because the bulk of the threads are tied up in I/O anyways).

For the throw box, we’re using Apache Bench. This is actually my first time using AB (I usually write my own little scripts to bench) – but holy crap is this tool handy. For each instance of the test, we ran AB from the throw box and took a reading over the first 10,000 requests.

Results

Again – these were run on a c1.xlarge box (8 cores) and represent a reading from the first 10,000 requests (with concurrency noted).

Workload 1 – Mean Response Time (Smaller is Better)

Workload 1 – Median Response Time (Smaller is Better)

Workload 1- 80th Percentile Response Time (Smaller is Better)

Workload 1 – Requests per Second (Bigger is Better)

Workload 2 – Mean Response Time (Smaller is Better)

Workload 2 – Median Response Time (Smaller is Better)

Workload 2- 80th Percentile Response Time (Smaller is Better)

Workload 2 – Requests per Second (Bigger is Better)

The raw results are also available here (includes other percentiles that may interest some folks)

Workload 1 - https://docs.google.com/spreadsheet/ccc?key=0Aiym6-aR-eJydHFOcExORzNEbkRONVlOVHJ1ek85dEE

Workload 2 – https://docs.google.com/spreadsheet/ccc?key=0Aiym6-aR-eJydEd3OW4zTnFyRG0zT2tpMzEySXlLd0E

Conclusions

So a few conclusions

  • There’s a marked difference in RPS and response time for new async frameworks and sync frameworks – this becomes especially obvious as the number of concurrent requests exceeds the number of physical cores on the box. Async frameworks achieve a peak RPS at above concurrent workers
  • Workloads demonstrating a richer I/O workload (as a % of wall time) get larger benefits from using an async framework. Workloads with a pure CPU workload should demonstrate near identical performance on both sync and async frameworks.
  • Go achieves very terrible scores on workload 1 and really excellent scores on workload 2 – I think this is pointing to particularly crappy JSON and ZLib implementations present in Go.
  • Mono 3.0.1′s implementation of .NET 4.5 is super disappointing and really not ready for primetime (async keyword isn’t implemented yet).
  • It’s a giant mistake to assume that system util libraries demonstrate good/best performance. The best JSON lib can be 10x faster than the worst.
  • There appears to be an emerging trend of decoupling a web framework from it’s server container (e.g. having a local nginx proxy). This introduces some new complexity in managing both processes and making sure that the guest web framework keeps its service and workers in a healthy state.

Framework Configuration

Below are some rough notes on how we configured each of the frameworks

PHP

  • apc.stat off
  • max mem 32MB
  • apache prefork
    • StartServers – 8
    • MinSpareServers – 5
    • MaxSpareServers – 20
    • ServerLimit – 256
    • MaxClients – 256
    • MaxRequestsPerChild – 4000
  • Predis v0.8

Node

  • Using cluster module – 16 workers (2x CPU).
  • Had to build Node from src – required juggling python versions.
  • Single node instance for NGINX Proxy
  • Using node_redis client w/ async connections. Connection formed prefork
  • Noticed no performance difference w/ and w/o using hiredis (high performance C extension for node redis serialization)

Netty

  • Performance was absolutely atrocious until we swapped out the JSON Lib (json-simple) for Jackon’s ObjectMapper. This brought RPS for 35 to 300+ – a 10x increase
  • Using the lettuce redis client framework (the only that seemed to support an async connection). Had to compile this from src.
  • Using a singleton connection (similar to node)
  • Netty’s framework is very arcane – this example by far took me the longest to implement :)

Go

  • Used simple forking model w/ 2x CPU workers. No local proxy
  • Used out of the box JSON/ZLib implementations – These appear to be unoptimized.
  • Used go-redis library – no issues

Mono

  • Had to build Mono (3.0.1) from src.
  • Mod Mono Server 2.5 hasn’t been updated to run 4.5 code – tried running a local proxy to XSP server (which can run .NET 4.5 code)
  • Async keyword not implemented yet – ran into this when trying to setup an Async Httpd Handler
  • Benchmark represents fully sync implementation. Ran Mod Mono on Apache both prefork and using the worker MPM. Worker MPM (the recommended configuration) actually yielded crappier performance for this workload
  • Used ServiceStack implementation – no issues here.

Ruby

  • Initially started w/ rails based app – then realized webrick isn’t meant to be a production server :)
  • Tried to get mongrel cluster setup and failed – cut over to thin server
  • Thin server is pretty good – I wound up cutting out all the rails stuff and went with a simple shim app (rails I suspect was likely going to add some amount of overhead)
  • Setup 16 instances of thin server and configured nginx to round robin proxy to each of them
  • Used redis-rb client with no issues

JRuby

  • No major instances setting up JRuby ontop of an existing Ruby installation other than understanding the gems are managed distinctly :)
  • Used above ruby configuration except for thinserver
  • Used Trinidad server instead of thin server here – expect single JVM instance schedules threads better than the Frankenstein NGINX proxy + ThinServer instances

Code

We’ve made the code available for anyone who’d like to compare resuts – super interested in if anyone has a killer tweak that will make one of these frameworks change performance characteristics substantially for these workloads :)

https://github.com/juiceboxgames/web-bench

Comments

  1. funny_falcon says:

    You should try Ruby 1.9.3 with Rainbows using several working processes and ThreadPool style workers.

  2. For PHP it would be interesting to change a few components:
    Apache -> Nginx
    mod_php -> php_fpm
    Predis -> phpredis (extension)

    • Nginx + PHP-FPM will make a huge difference, the phpredis, probably not so much, Predis library is good enough (it all comes down to sockets in the end anyway).

      From my experience, simply switching from Apache to Nginx can improve the performance up to a factor of 10x.

      So yeah, it’s not exactly fair to compair Apache with Node running using Nginx – at least compare the same Web server.

  3. Would you consider performing a similar benchmark with D and Vibe.d? http://vibed.org/

    I’m curious to see how well it does on these workloads.

  4. Silas Baronda says:

    Like funny_falcon said, use Ruby 1.9.3 also use OJ for handling of JSON decoding and encoding.

  5. I ran your go code against tip (the latest developement version) and i got a bit a performance increase.
    TIP
    Concurrency Level: 250
    Time taken for tests: 78.262 seconds
    Complete requests: 10000
    Failed requests: 0
    Write errors: 0
    Total transferred: 1310000 bytes
    HTML transferred: 150000 bytes
    Requests per second: 127.78 [#/sec] (mean)
    Time per request: 1956.553 [ms] (mean)
    Time per request: 7.826 [ms] (mean, across all concurrent requests)
    Transfer rate: 16.35 [Kbytes/sec] received

    go 1.0.3
    Concurrency Level: 250
    Time taken for tests: 104.823 seconds
    Complete requests: 10000
    Failed requests: 0
    Write errors: 0
    Total transferred: 1310000 bytes
    HTML transferred: 150000 bytes
    Requests per second: 95.40 [#/sec] (mean)
    Time per request: 2620.563 [ms] (mean)
    Time per request: 10.482 [ms] (mean, across all concurrent requests)
    Transfer rate: 12.20 [Kbytes/sec] received

  6. I would love to see the benchmark with APC enabled for PHP

  7. Interesting comparison. Please redo the graphs though, I find it hard to distinguish the different back-ends from each other.

  8. This is a really useful post, thanks for sharing. I’m curious, why use an NGINX proxy for node? It seems like unnecessary overhead.

    • It seems to be pretty standard to setup a proxy infront of Node, for a few reasons

      - Nginx (or Apache for that matter) is much better at serving static content than Node – it’d be like using PHP to file_get_contents for static files and serve em back. Now granted – your app servers shouldn’t really be serving ANY static content (should be hosted on a CDN) – but if you’re going to do it, best to have your web container (and not app server) stream that content back directly.
      - I think the scheduling model for Nginx degrades much better than a naked Node application serving traffic. Nginx workers will queue up and wait processing by the local node instance more gracefully than HTTP clients.
      - I think transmission using an Nginx worker is cheaper than tieing up a Node worker. E.g. Node worker transfers data to local Nginx worker, Nginx worker manages transmission to client (which could take its sweet time receiving the bytes, especially if its a mobile client)

      This is a maybe bit hand-wavy, but in my tests projecting load directly on the Node instance with AB produced a much wider variance (and correctness issues) when dealing with a naked Node app serving traffic. That’s not to say that you couldn’t harden a naked Node app – but its probably a fair amount of upfront investment.

Share your thoughts

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 43 other followers