Surge is brought to you by OmniTI, the leading Web Scalability and Performance provider.
Head of Operations, Betable
Track: Scaling Architecture
A State Machine Datastore in the Wild
Betable, provider of gambling-as-a-service and my employer, recently made the jump from single-player, single-event games of chance like slot machines into multi-player, multi-event games like blackjack. This new service broke a lot of fundamental assumptions and resulted in a new WebSocket-based API, new game services and a new datastore. This is a story about that datastore.
We used to store everything in Cassandra and Cassandra was good to us, especially operationally. Naturally, we first explored how we would use Cassandra to support multi-event games. We prototyped two designs based on Cassandra. Both left a lot to be desired, performance-wise, on the bursty and relatively high-concurrency workload we tested.
So we went to the whiteboard to design our way out of "death by roundtrip." It was time to move computation to the data. The plan became to implement a distributed, replicated state machine in Go. I'll go over the interesting parts of the design: the data model, disk and wire protocols, replication and disaster recovery, secondary indexes and instrumentation, too. We made a lot of mistakes, but the fundamentals were sound so we were always able to recover, usually after extreme panic.
I'll also go into detail on the human factors. We skipped the honeymoon phase of the project in which we were "on time" and generally handled our deadlines and expectation-setting poorly. We further complicated matters by splitting what would typically be one service and one dumb datastore into two services, and developed them in tandem. Despite these hurdles we shipped everything without catastrophe.
This is the story of what went right and what went wrong as we developed, deployed and operated this new service for our game developers and their players. We'll see if by September this all still seems like a good idea.