Prelude to building a platform

Featured

I first joined Atlassian back in around 2015. It was a time when Atlassian was getting serious on cloud and was running into some challenges with the architecture of its current platform.

The architecture of the platform of the day was to take Jira and Confuence server products, bundle them up with a cut down version of Crowd for the Identity & User management capabilities, package them into a Virtual Machine image and then for every customer who signed up they would be provisioned a new VM instance running in one of Atlassians data centres.

As a first incarnation of a hosted offering it was a great idea, however Atlassian had several problems:

  • massive growth – every problem is an opportunity and every opportunity is a problem. It was great for the business but Atlassian cloud was on an exponential growth trajectory and that meant the system had to scale
  • customers don’t use the product 24/7, they come and go, but when they want it, they want it now; hence that means a VM has to be running 24/7 and that means it’s consuming RAM. Hence, the dominant factor in your infrastructure costs scale with the number of customers you have (regardless of how big they are or how much they’re using your product).
  • this meant Atlassian cloud was expensive to scale – you incurred a cost for every customer (using the system or not) which put upward pressure on cost which is a big problem if a big part of your strategy is keep the cost of your tool low and you are looking to offer a freemium version of your product.

There were some incremental attempts to address problems and reduce memory usage like trying to move some services out of customer VMs into a common set of services that serviced many customer VMs but ultimately the writing was on the wall that the only way Atlassian was going to win in the post-server world was to go all in on a solution that scaled with customer usage; enter a program called Vertigo.

Vertigo is a massive company wide story that ran for several years with varying degrees of investment and steps along the way. Atlassian went all in on AWS, built out a platform of platforms to support the move and has been in an ongoing effort to transform a monolithic codebase designed to run on a single server, to tenantless services that scale on usage (rather than number of customers). It has since grown to support hundreds of thousands of customers and millions of users.

Through this process all teams were making decisions on the varying degree to which they’d rewrite, decompose or otherwise change their systems to work in the new world. Jira and Confluence focused mostly on a lift and shift effort without massive investment in decomposing themselves. Identity went a very different path and opted to mostly rebuild the platform for the cloud and made major shifts in our conceptual model (e.g. pre-cloud user accounts were local to a product, in the migration to cloud all users moved to a single global account keyed by email address) . We didn’t rewrite everything and six years on we’re in the final throes of separating ourselves from our pre-cloud legacy, but at the start we had zero micro services and by the end we had almost forty.

We went from a platform products synced data from to one thats on the hot path for every request and today we serve up over 100 billion requests a week at latencies measured in low tens of ms and reliability > 99.999%.

2015 to 2016 was a wild time. It took us around 2 years to build the platform and launch it to our first customer ( a recent signup with fewer than ten users ), but we were first to get there – both as a point of pride, but also out of necessity (cloud products don’t work without users or accounts).

Coming up next… lessons learned along the way; stay tuned!

Hello World

Featured

Welcome to my blog.

Who Am I? Tempting to say Will.I.Am as I do like his music, but in this distributed remote first world most often I’m referred to by my slack handle at work as @fireatwill.

I’ve been many things over the years. I’m dad to my daughter Lily and son Jackson, husband to Caroline my wife, head of engineering, mountain biker, lover of unique beers (my wife is belgian and I confess a love for Gueuze which like my tastes in 90s pop culture are a bit of an acquired taste) and tinkerer.

I’ve been building system in code and with people for almost 20 years. I love knowing why things are what they are and going deep on connecting the dots of the how with the why, so its no surprise I ended up where I am. I’ve blogged on and off on internal corporate blogs over the years but finally thought it time to make an effort to share some of the lessons learned along the way more broadly and connect with other like minded folk on their experiences.

Probably the most interesting story for me has been the past 6 years with Atlassian. Prior to that most of my focus had been on building software and teams, but I joined Atlassian at the beginning of its current cloud journey and one of the biggest learning opportunities here for me has been the shift from the traditional focus on building software for functional requirements to what it means to build and run systems at scale. The ship, learn and iterate that comes from running these system, the debates on how you measure reliability and the challenges that come out of lesser explored non-functional requirements that come to the forefront at 3am when things go bump in the night have been some of the funner problems to puzzle over.

I hope you may learn something, share back but most of all enjoy.