Prelude to building a platform

I first joined Atlassian back in around 2015. It was a time when Atlassian was getting serious on cloud and was running into some challenges with the architecture of its current platform.

The architecture of the platform of the day was to take Jira and Confuence server products, bundle them up with a cut down version of Crowd for the Identity & User management capabilities, package them into a Virtual Machine image and then for every customer who signed up they would be provisioned a new VM instance running in one of Atlassians data centres.

As a first incarnation of a hosted offering it was a great idea, however Atlassian had several problems:

  • massive growth – every problem is an opportunity and every opportunity is a problem. It was great for the business but Atlassian cloud was on an exponential growth trajectory and that meant the system had to scale
  • customers don’t use the product 24/7, they come and go, but when they want it, they want it now; hence that means a VM has to be running 24/7 and that means it’s consuming RAM. Hence, the dominant factor in your infrastructure costs scale with the number of customers you have (regardless of how big they are or how much they’re using your product).
  • this meant Atlassian cloud was expensive to scale – you incurred a cost for every customer (using the system or not) which put upward pressure on cost which is a big problem if a big part of your strategy is keep the cost of your tool low and you are looking to offer a freemium version of your product.

There were some incremental attempts to address problems and reduce memory usage like trying to move some services out of customer VMs into a common set of services that serviced many customer VMs but ultimately the writing was on the wall that the only way Atlassian was going to win in the post-server world was to go all in on a solution that scaled with customer usage; enter a program called Vertigo.

Vertigo is a massive company wide story that ran for several years with varying degrees of investment and steps along the way. Atlassian went all in on AWS, built out a platform of platforms to support the move and has been in an ongoing effort to transform a monolithic codebase designed to run on a single server, to tenantless services that scale on usage (rather than number of customers). It has since grown to support hundreds of thousands of customers and millions of users.

Through this process all teams were making decisions on the varying degree to which they’d rewrite, decompose or otherwise change their systems to work in the new world. Jira and Confluence focused mostly on a lift and shift effort without massive investment in decomposing themselves. Identity went a very different path and opted to mostly rebuild the platform for the cloud and made major shifts in our conceptual model (e.g. pre-cloud user accounts were local to a product, in the migration to cloud all users moved to a single global account keyed by email address) . We didn’t rewrite everything and six years on we’re in the final throes of separating ourselves from our pre-cloud legacy, but at the start we had zero micro services and by the end we had almost forty.

We went from a platform products synced data from to one thats on the hot path for every request and today we serve up over 100 billion requests a week at latencies measured in low tens of ms and reliability > 99.999%.

2015 to 2016 was a wild time. It took us around 2 years to build the platform and launch it to our first customer ( a recent signup with fewer than ten users ), but we were first to get there – both as a point of pride, but also out of necessity (cloud products don’t work without users or accounts).

Coming up next… lessons learned along the way; stay tuned!

Leave a comment