Change management in a DevOps world – Or – How to rodeo with one of the two horsemen of the apocalypse

2 horsemen? What happened to 4? Well 2 of them tend to visit quite frequently in being the underlying cause of Incidents. Change and Scale. The other 2 I’m yet to settle on, but I reckon one should be security. Security seldom visits but when she does she doesn’t just hit hard and run like the other two, she brings the rain. Change on the other hand visits frequently and should be high on your list of things to noodle over if you want to have teams running services in the cloud. With that in mind, let’s talk about my time with Atlassian and the varying flavours of dealing with it.

Once upon a time the job of a Software Engineer was to write software. However that simple definition started to blur around the time Agile found its footing and started talking about cross-functional teams. However, the real death knell came with the DevOps movement and true vertically integrated delivery from commit through to end customer with no other parties involved. At some point we’ll have to revisit what we call Software Engineers to better reflect that they now are effectively service providers but thats not what we’re here to discuss today. When I joined Atlassian it was about the time it started to shift from having teams that dealt with operations, to teams that were embedded to help support teams in running services and aspirations toward You-Build-It-You-Run-It.

With the big lift and shift to a constellation of tenantless microservices in cloud, Identity was forced to go the whole hog early on and fully embrace a You-Build-It-You-Run-It model. Most other teams came later or are still on their journey to decompose and make use of a Site Reliability Engineering model. Any approach to roll out services with minimal change related incidents requires a combination of many things; technology/tooling, process and culture.

There are several ways you can approach this with your team:

Change Advisory Boards – If you are doing this (or subject to it), stop; grab a cup of coffee and go read Accelerate. One of the best treatises on the subject of software team performance I’ve read that found “approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.”
Chopper – A process in Atlassian for teams mostly working on monolithic systems with many changes committed to a single system that need to be bundled up and rolled out. It is a centralised process where changes are brought to the Chopper meeting with a panel of experts on the system, they’re reviewed and a “pilot” is nominated to shepherd the batch of changes through. It sounds like a CAB but it is different. Primarily, the panel are fellow developers working on the systems with a large degree of expertise (i.e. it is an internal review by experts, not an external one). It is a great system when you need to coordinate a large number of changes through a single deployment pipeline or when you don’t have enough expertise in your teams to go for something more autonomous.
The Bar – Is what we went with in Identity. With microservices owned by the team, there is no panel of experts outside the team for the systems they work on. We had the knowledge within the department, and for the most part in teams. What we didn’t have was
- the sense of empowerment and accountability – delivery pressure, lack of clarity on expectations to push back on “just ship it” were all cited as reasons for taking risks that contributed to incidents
- clarity on when/if there had been a review of the plan to manage the change risk; moreover lack of clarity on who is signing off on that work and when – no one should mark their own homework.
- awareness of tools / patterns / solutions to common gotchas – Also varying degrees of inconsistent application of tools / standards / patterns (e.g. “fast five” – an approach for making changes to schema in isolation from code)

So which way do you go? Well obviously not the first. But if you’ve a need to coordinate a high volume stream of changes into a monolith from many teams or your teams don’t yet have the expertise to write and operationalise services themselves go Chopper. However if your teams know what to do, but you have issues on empowerment and accountability… go The Bar.

What is The Bar?

The Bar is made of up three pieces:

The Manifesto – a statement around expectations to be applied on every change. It is above all else a statement of empowerment and accountability for the engineer making the change and the one reviewing it.
The Process – the agreed upon means by which The Manifesto is enforced – at the core is peer review of the thinking and work that goes into a change. This includes consideration of “The Standards”- practices are are strongly recommended but may not be applied due to lack of awareness or encouragement.
The Standards – A set of patterns and tools that represent best practice and should apply to all teams and services. For example: When using progressive rollout, all services should be monitoring for elevated rates of 5xx and 4xx errors on the new stack. It is deliberately not exhaustive, but lists important tools/patterns teams should follow that may not yet be second nature to everyone in the team.

The Manifesto

I will not merge / promote / release until it meets The Bar. At a minimum that means –

We are confident:

we can detect breaking changes through automated means before they go out to production
we have automated means to measure that things are working as expected and will be alerted when they’re not working as expected in staging and production
if we release a breaking change, the impact would be mitigated within SLO and without causing a major incident
this change is resilient to failures in its dependencies
we understand the load expectations on our service & its dependencies and the change in load can be safely managed

You-Build-It-You-Track-It – for any change you are accountable for ensuring it passes The Bar; you have full credibility for the work as well as full responsible for safely shepherding it to production.

Approving-Is-Co-Authoring – For any change you approve, you are responsible for understanding the change and agreeing that they pass The Bar; you should consider the change co-authored by you.

The Manifesto is key and is basically an agreed definition of done for operational concerns that give engineers agency to question and push back on shipping if they lack confidence and wouldn’t put their name to the decision to ship it.

The Process:

We debated if there should be a process for simple changes and one for complex / bigger changes. Rather than then push the conversation to debating if a change is simple or complex (when in reality any change is a on a spectrum that requires application of judgement); we opted for a single process.

At its core is the application of judgement from the engineer owning the issue to ensure The Bar is met, peer review by one member of the team who shares equal accountability and a supporting list of best practices we expect teams to explicitly consider.

The process…
- Every issue starts …
  - “In-Analysis”
    - planning what needs to be delivered to meet The Manifesto outlined above with a peer
    - which of the practices below apply to this particular issue
    - documenting this on the issue
- For every change that will go out to production (eg. a PR, or feature flag change)
  - A reviewer will check the documentation, created above, and make sure all relevant items are considered and any additional items required to meet the manifesto and implemented in the change before it can go to production. Discussion documented on the PR (or appropriate area for non PR based changes)

Expected outcomes

All changes have a kick-off with a peer reviewer to assess risk and form mitigations
No changes will make it through to prod without a reviewer assessing that the mitigations are in place

The beauty of this process is its flexibility, teams can tailor other checkpoints in how they like. Some teams augment their design reviews with specific questions relevant to known risk areas and problems. Some have automated rules triggered on raising a PR that will detect specific changes and require responses to specific question. They’re all different, but they all define how their process works and the checkpoints to ensure the Manifesto of the bar is upheld. Between the flexibility in the process and the accountability from the Manifesto The Bar ensures teams create processes that will deliver on the outcome, rather than the output of following something more prescriptive and hence not enabling them to apply their judgement (the Bar is there to support good judgement, not get in the way of it).

The Standards

I won’t paste them verbatim here because they really depend on where your team are at and some of them involve adoption of Atlassian tech but to give you a flavour:

Progressive rollout with anomaly detection and automated rollback – have the least amount of impact we can have and detect it as early as possible and automate the response – 99.99% only allows for 4 minutes of disruption on a month, you can’t do that if you’re detect response loop involves humans and a large percentage of users.
Fast 5 for data changes – basically rollout new code to deal with schema change, roll out the schema change, roll out the new behaviour change, remove the old data, code and behaviour. The intent here is to have a safe tested way of running multiple versions of data, code and behaviour in parallel (because you may need to revert back to a previous state at any point until the entire new state is validated.
Safe rollbacks up to 48 hours – even with the above and with progressive rollout we know some problems aren’t detect right away and the fastest way to resolve an incident is often to auto-rollback to a previous version if a new one was recently deployed. Endeavour to make this an unquestionably safe operation.
Consistent two-way feature flags – feature flags are progressive, often eventually consistent and may flip back at any time. Do not assume flags progress only ever in one direction, even if you plan to use it that way your underlying feature flag service may fail and revert to the default state.
Announce changes, have appropriate docs and plans, ensure support staff know what is happening so they can alert on the unexpected (and not look like numpties if something new goes out).

So how are we doing…

We had our quarterly TechOps review where we look at operations and incident trends. Major incidents due to change have continued to trend down throughout the year (and TBH we’re in very low numbers). Teams aren’t stifled by rubber stamping process and check boxing that doesn’t apply to their context and they feel empowered because The Bar doesn’t prescribe what to do, it just provides the agency for teams to do the right thing.

The foundation of DevOps – If you want to be good at something… you’ve got to spend time on it

Featured

I came from a software delivery world, and I mean that very literally. Wind back the clock not even 10 years and SaaS was an outlier, not the mainstream. Software developers were responsible for shipping software (even the term shipping has connotations of that being the act that delivers value to customers rather the just the first step). In the world of software delivery efficiency, time is valued when it is spent working on features. Which begs the question, what happens when a team now has the added task of running services; and what does that even mean or entail?

My journey in DevOps started around 2016. At the time we had an SRE team for Atlassian account and we had operations folks that looked after our VM-Per-Customer offering. We had just started launching services and had settled on You-Built-It-You-Run-It as our operating model; but how would software engineers make the transition? Luckily, we had a couple of tail winds that helped us make the transition. Firstly, Identity started out as a platform that was mostly a system of record; users, groups, access was configured in Identity but it was then synced out to the products who dealt with the gnarly runtime requirements. This meant that we didn’t have a large volume of requests or low latency or even high reliability. Secondly we had evolutionary pressure from a new product being developed – Stride (HipChats to-be-but-not-quite successor). Stride didn’t want to (nor had the time) to build all the non-differentiating stuff. At the same time they had a strong engineering lead who early on established the measure of operational success – “Can I F$#ken chat?” and a weekly spark line showing where they weren’t able to and who was responsible.

I was trying to figure out – how am I going to justify to myself or anyone else the investment that is going to be required for teams to know what is going on in prod, figure out how to fix it and prioritise it amongst all the stuff we have to build and ship? I remember a discussion with the SRE lead in Identity at the time and in a very simple sentence he gave me something that has stuck with me every since…

If you want to be good at something you need to spend time on it.

I’m going to labour the point here a bit so indulge me. The reason this statement is so powerful is that:

its simplicity – never underestimate the power of a concise message that short enough and powerful enough to be repeated by others.
I’m not telling you what needs to be done. It is a question with an obvious logical consequence you can’t argue with. The reality is self evident; we need to be good at operations, that teams (not just individuals) need to spend the time to grow that muscle and learn to become good at it.

Great we’ve got the time, so now what do we spend it on?

People often don’t realise it but engineering leaders still build systems; they just extend out beyond software. We work in a repository of people, tools and processes. One of the engineering disciplines that has influenced how I think about steering teams toward success is control theory; thinking of the desired output of a system and the role of feedback in driving it toward that state. In this case by asking teams to meet, to look at their key operational data and determine corrective action in a weekly meeting that is known as TechOps. The outcomes are presented to leadership and teams held accountable in TechOps rollup (held the day after). The ritual came from Stride but like all good engineering processes it is basically an implementation of the inspect and adapt feedback loop with leadership choosing the things that are of value and the weekly presentation providing bidirectional feedback on if we’re moving in the right direction.

Alright, but what should teams be looking at? What is “key operational data”? The answer is it is something that evolves and what things teams focus on and how many is a key way leaders drive those changes and the evolution of your operational maturity. Ours was quite simple:

Incidents in the past week – what incidents did we have, why, what are we doing about it
Alerts – how many have we had, are they signal or noise? If they’re signal what action do we take to automate addressing the problem.
Service SLOs – did we meet or miss / why and what are we doing?
Overall what are the trends? We care less about an errant week than an ongoing or emerging problem.

Why these? At the beginning we didn’t have great telemetry. We had service level SLOs (like how many 9s of 2XX vs 5XX HTTP responses you had) but that didn’t really correlate with customer impact (Incidents). High quality alerting is critical – if signal-to-noise is poor alerts are ignored, if you have none you’re entirely dependent on customers to tell you there is a problem or engineers noticing something is going wrong (the quality of your automated detection is absolutely key). Which brings me to another philosophical maxim…

If you have $10 to spend on reliability, spend $8 on your observability

This is heavily predicated on “If you can’t measure it, you can’t improve it”. You can pour a lot of concrete to solve a problem you don’t have or leave a lot of problems unanswered if you don’t have good eyes on what customers are experiencing in Production. In the absence of good observability you are relegating your customers to being detectors. In case it is not apparent, customers do not like this and are the one form of detector that can self elect to remove themselves from your observability suite. They are however the most accurate detector and you will fall back on them from time to time, so do yourself and them a favour – make this a rare occurrence.

So what kind of measures should you have?

service quality – did the system deliver the experience expected to customers aka. “Can I f@#king chat”.
early warning signals – measures inside the system that are a prelude to a problem occurring – heaps size, CPU, latency, queue length; i.e. technical measures that correlate to the system behaving correctly but can’t actually tell you if customers are getting what they want. They can however tell you when there is a problem brewing that may soon result in a service quality issue.

Why have both? Why not just measure service quality and be done with it? In an ideal world if you can resolve your performance to customers down to whether your end point returned 2XX then you are living the dream. This is a great first step on your DevOps maturity but what was the payload in that response? Did it contain the expected information (i.e. is the system up but wrong)? Your API may be working but is the single page App that sits atop it behaving? Resolving answers to those questions can be complex. For Identity one of our service quality metrics is “Can I F@#ken login?” Login is a multi-step flow depending on if you are using username/password, SAML, is 2FA enabled? Did I need to reset a password? Did a user go and grab a coffee part way through that flow? We are more tolerant if 1/1000 logins failed but could immediately retry and succeed (i.e. the failure is not persistent). Hence we can’t immediately tell if login is falling below our internal SLO of 99.99% because you need to track users from start to finish and we allow up to 5 minutes for a user to do that before saying “if a user didn’t login and we saw an error of some kind we will treat that as a failure”.

The other reason teams don’t think about is…

Your uptime is part of your marketing collateral

Atlassian is a federation of products, new acquisitions are made, new products are built. Atlassian serves everyone from 10 member startups to the fortune 500. One of the big impediments of moving to cloud are concerns on reliability. Hence, one of the biggest worries for those customers (and the products looking to come aboard your platform that serve those customers) is your uptime. Having a 12+ month trend of meaningful service quality metrics goes a long way to built trust and transparency.

In case you don’t already have your number-of-nines times tables memorised; 99.99% across a month only allows 4 minutes of downtime before breaching. Service quality measures by definition only tell you once you start to negatively impact the customer experience and can have too much lag to be your primary means of detection / automated rollback. Early warning signals give you faster feedback that is less precise but faster to respond to; hence you need both in balance.

Signal to noise ratio is king

My last piece of advice for this post is that If I said invest $8 on observability, spend half of that on ensuring you dial signal to noise up to 11. It shouldn’t really cost you that, but in terms of priorities that is where I’d set them. Folks say noisy alerts wake people up at night, this is true. But any alert that fires on a regular basis because of some expected but un-actionable phenomena creates a noise floor that masks real problems slowing MTTR (in some cases by many hours / overnight) because the team thought it was the “everything is as expected alert”. They can also hide paper cuts and bugs that might not be major incidents but may present low hanging opportunities for the team to improve.

Do not overlook signal to noise in logs. We specify a standard on how logging levels should be used:

Error – Something that should never happen and if it does warrants investigating – e.g. a service responded with a 409 when we only ever expect it to respond with a 200, 401 or 5XX. If this happens you should fire a high priority alert because something is up. You should expect to see Zero errors.
Warning – Things that go bump in the night but are anticipated – e.g. a failed call to a dependency, you’d log as a warning, you may choose to alert on these if you see high volume in a short window. You may see many of these in a week for a high volume service and you should investigate if you are seeing a number of these or an upward trend.
Info – BAU logging and operations. E.g. used for access logging for people looking to investigate things

The best teams I’ve seen actively review their alerts and logging during standup to talk about whats going on in the system and filter through work to drive those numbers down. Having consistency and clarity on what logging levels equate to what level of response and using it to drive investigation and fixes of underlying problems is a great way to ensure a fast response when there is a major problem as well as ensuring overtime you address the paper cuts that can add up. It also removes tribal knowledge on what alerts/logs “are ok/expected” vs those that “might be bad” – the level and trends tell you.

So where are we now?

On the day TechOps was announced, our SRE lead presented at the town hall and asked everyone in the room to put their hand up if they knew the SLO of the services they ran. In a room of ~50 people I think I saw maybe 4 hands go up.

Today every team runs TechOps on a weekly basis and we do a rollup. Identity is seen as a leader inside Atlassian when it comes to operations and reliability. We now focus on other metrics beyond service hygiene and health we talk about:

Security Vulnerabilities and approaching SLA breaches to ensure we patch high CVSS issues promptly
Support escalations and SLAs to ensure teams are acting on support asks with urgency
Cost and wastage – we look at our AWS spend trends and obvious opportunities to reduce waste
Service Quality metrics where we’re in breach or we’ve burnt more than 20% of our error budget – we now have enough confidence in teams to look at their own early warning signs and signal to noise in observability that what we focus on is accountability and action on quality measures.

We have a quarterly operations review where we look at overall trends, proactive reliability initiatives to make systems ever more resilient and compartmentalised in their blast radius when things go wrong. We have a quarterly FinOps review to project costs for the coming quarters and consider opportunities to change the system to make it more cost efficient. Are we done yet? No… DevOps is a journey and is as much about building knowledge and a culture in teams as it is about the resulting services that get run. Even with a platform thats globally distributed serving >10^10 requests a week, we are all still learning.

Thanks for reading, as usual, keen to hear your thoughts / experiences and feedback.

Prelude to building a platform

Featured

I first joined Atlassian back in around 2015. It was a time when Atlassian was getting serious on cloud and was running into some challenges with the architecture of its current platform.

The architecture of the platform of the day was to take Jira and Confuence server products, bundle them up with a cut down version of Crowd for the Identity & User management capabilities, package them into a Virtual Machine image and then for every customer who signed up they would be provisioned a new VM instance running in one of Atlassians data centres.

As a first incarnation of a hosted offering it was a great idea, however Atlassian had several problems:

massive growth – every problem is an opportunity and every opportunity is a problem. It was great for the business but Atlassian cloud was on an exponential growth trajectory and that meant the system had to scale
customers don’t use the product 24/7, they come and go, but when they want it, they want it now; hence that means a VM has to be running 24/7 and that means it’s consuming RAM. Hence, the dominant factor in your infrastructure costs scale with the number of customers you have (regardless of how big they are or how much they’re using your product).
this meant Atlassian cloud was expensive to scale – you incurred a cost for every customer (using the system or not) which put upward pressure on cost which is a big problem if a big part of your strategy is keep the cost of your tool low and you are looking to offer a freemium version of your product.

There were some incremental attempts to address problems and reduce memory usage like trying to move some services out of customer VMs into a common set of services that serviced many customer VMs but ultimately the writing was on the wall that the only way Atlassian was going to win in the post-server world was to go all in on a solution that scaled with customer usage; enter a program called Vertigo.

Vertigo is a massive company wide story that ran for several years with varying degrees of investment and steps along the way. Atlassian went all in on AWS, built out a platform of platforms to support the move and has been in an ongoing effort to transform a monolithic codebase designed to run on a single server, to tenantless services that scale on usage (rather than number of customers). It has since grown to support hundreds of thousands of customers and millions of users.

Through this process all teams were making decisions on the varying degree to which they’d rewrite, decompose or otherwise change their systems to work in the new world. Jira and Confluence focused mostly on a lift and shift effort without massive investment in decomposing themselves. Identity went a very different path and opted to mostly rebuild the platform for the cloud and made major shifts in our conceptual model (e.g. pre-cloud user accounts were local to a product, in the migration to cloud all users moved to a single global account keyed by email address) . We didn’t rewrite everything and six years on we’re in the final throes of separating ourselves from our pre-cloud legacy, but at the start we had zero micro services and by the end we had almost forty.

We went from a platform products synced data from to one thats on the hot path for every request and today we serve up over 100 billion requests a week at latencies measured in low tens of ms and reliability > 99.999%.

2015 to 2016 was a wild time. It took us around 2 years to build the platform and launch it to our first customer ( a recent signup with fewer than ten users ), but we were first to get there – both as a point of pride, but also out of necessity (cloud products don’t work without users or accounts).

Coming up next… lessons learned along the way; stay tuned!

Hello World

Featured

Welcome to my blog.

Who Am I? Tempting to say Will.I.Am as I do like his music, but in this distributed remote first world most often I’m referred to by my slack handle at work as @fireatwill.

I’ve been many things over the years. I’m dad to my daughter Lily and son Jackson, husband to Caroline my wife, head of engineering, mountain biker, lover of unique beers (my wife is belgian and I confess a love for Gueuze which like my tastes in 90s pop culture are a bit of an acquired taste) and tinkerer.

I’ve been building system in code and with people for almost 20 years. I love knowing why things are what they are and going deep on connecting the dots of the how with the why, so its no surprise I ended up where I am. I’ve blogged on and off on internal corporate blogs over the years but finally thought it time to make an effort to share some of the lessons learned along the way more broadly and connect with other like minded folk on their experiences.

Probably the most interesting story for me has been the past 6 years with Atlassian. Prior to that most of my focus had been on building software and teams, but I joined Atlassian at the beginning of its current cloud journey and one of the biggest learning opportunities here for me has been the shift from the traditional focus on building software for functional requirements to what it means to build and run systems at scale. The ship, learn and iterate that comes from running these system, the debates on how you measure reliability and the challenges that come out of lesser explored non-functional requirements that come to the forefront at 3am when things go bump in the night have been some of the funner problems to puzzle over.

I hope you may learn something, share back but most of all enjoy.