Jakub FilipczakSRE Manager / Guardian @ IDEMIA

Short story of IDEMIA’s SRE

Follow the story of implementing Site Reliability Engineering and find out how to cope with such a challenge.

24.06.202012 min

What I would like to describe is a story of a specific company in which I was asked to take over the lead of a part of existing L1/L2 support teams and transform them into SRE team, supporting purely B2B SAAS housed in our private DCs, with all of the products falling under PCI-CP or PCI-DSS certifications. How we coped with such a challenge? Was it worth it?

Too many tickets

Imagine the typical software company, producing a bunch of software products, doing things on a bunch of business domains. Software that does all sorts of different things, some more complicated, some less. Now imagine how the software lifecycle is typically managed in that kind of place. I guess in your mind you’re now seeing a bunch of teams, development squads, ops, and infra teams. You may very well think they do try to cooperate with each other. Sometimes the cooperation works out, sometimes it does not. What draws attention are the teams forming up into separated silos, along with some high, high walls between devs and ops. And a lot of finger-pointing.

So, that’s pretty much the point we at Idemia were some time ago. Dev guys were coding, Ops guys were trying to keep the services coded down by devs up and running. Due to the nature of the business, we’re doing, we had to keep that formal separation between both departments. It obviously wasn’t helping anyone.

You may ask, why we started to look for other ways of managing the operations? That question has a fairly easy answer, obvious to all people involved in operations – it was caused by a number of outages, hampering our services, pulling people out of their beds in the middle of the night just to learn that “damn, the X thing is down again”. We had the support organization in place, but it was not an effective one. People were dealing with the tickets, forgetting that behind every ticket there was a business service failure.

Our starting point for that journey was one, large L1 team and one, large L2 team. Products that had little in common, documentation of sometimes questionable quality, four different business domains doing things in various ways, all handled by the same set of people. If you say that one cannot reasonably expect people to be experts on all of the grounds, you say exactly the same thing I’ve had said when I’ve had been asked to lead the transformation for one of the business domains. Support teams were obviously overloaded, while also having a little experience outside of typical operations, meaning that more often than not they weren’t aware of how the platforms actually work. “Roll it out, check if it’s up, no exceptions in the logs – we’re done. If something doesn’t work, log the ticket. I’m sure you have seen the same scenarios in other companies too, I certainly did.

Looking for the remedy

We knew the situation was not sustainable in the long term, with clear signs of what was not working correctly, and with no real plan on how to change the situation apart from some small improvements. We’ve seen things were not moving in the right direction, but we didn’t know how to change the course of that boat. And then, one of our architects mentioned something about the Google book, titled Site Reliability Engineering. And that was kind of a revelation, book speaking about obvious things we already knew, that somehow we were missing in our daily jobs. It looked like we just didn’t know how to name them. We realized the SRE concept had some potential.

The IDEMIA’s management team took a decision to fire up a proof-of-concept implementation of SRE for a single team, checking as they go if it makes any sense in our ecosystem. The easiest way to do that was to plan and execute the L1/L2 transformation, and while that approach was not ideal from the SRE point of view, we decided that is going to be our first step. At this point of time, we also prepared a breakdown on how we imagined the Idemia SREs will differentiate from the legacy support organization, on the basis of what we already knew was not top-notch in L1/L2.

As you can see, we came up with the idea of creating not one but two separate teams, and while that doesn’t follow what Google is advocating for, in our situation it was perfectly reasonable to split the SRE team because of one simple reason – our customers expected 24/7/365 support services to be available. Dragging on-call engineer to answer customer in the middle of the night every other day would simply be not sustainable. We did make a couple of important assumptions at that point that were the basis for our plan of transformation. One major assumption spanning on both teams was that we stop pretending everyone in the team knows every product and underlying tech stack. Instead, we grouped people in separate; business domain-oriented teams and asked them to focus on a single business, giving them the opportunity to get familiar with the service, middleware, and the customers.

SRE Monitoring Team – spot the problem before the customer does

The major idea behind that part was to form a team of engineers, ready to work on 24/7/365 calendar, having an understanding of what their products actually do. That doesn’t sound very different from what L1 means, but in our case that wasn’t exactly true – and that’s reflected by one point – “Spot the problem before the customer does”. The rationale behind that had a historical background, in the times of L1 we had many problems that were reported by the customers, not being picked up by us. You can imagine how the customer service satisfaction surveys looked like, when customer ops teams used to call us asking what’s going on with platform X, and we were saying “well, my monitoring is green”. From the point in time the SRE has emerged, every serious problem reported by the customer, and not by us is treated as our failure to provide adequate service. I can’t help myself, I just have to pull up some numbers – the number of issues reported by our customers before us knowing was reduced by more than 80% in a year’s time. I don’t know about you, but I’m impressed with what those folks managed to score. There were some simple things that helped them to do so, the major one being the monitoring tools.

Back in the old days, monitoring dashboards were a part of product deliverables, and the development team was usually asked to deliver that. And usually, they did deliver, extremely comprehensive dashboards, with tons of information, checking every aspect of the service the product was offering, fine detailed graphs and gauges. Sound nice, doesn’t it? In theory, it was, but that’s one of the things we decided to trial, in the form of Wheel-of-Misfortune training (thanks Google!). Surprisingly it turned out, that the dashboards were so complex, it was actually easier to check application logs then to have a quick look at monitoring. On top of that, the team was ending up with many alarms firing almost all of the time. The result of that was easy to predict – support engineers started to ignore alerts. “Ah that one? Nah, there’s no reason to be worried about that, it’s always red”.

And that’s the time we decided to hand over the dashboard creation and maintenance task to SREs, assuming that no one other part of the company knows better how to handle production than people that actually handle the production. That change also had a side effect; the SREs preparing the dashboards usually felt the urge to fully understand the product he was dealing with, asking tons of questions to the dev guys. Learning how the service behaves, and what it truly does. The result was kind of a surprise, we moved from extremely detailed dashboards to much simpler, sometimes even crude ones, focusing on the service that is being delivered, in the dimensions of four golden signals wherever it was possible, RED where it was not. We of course keep the detailed ones as well, but the major point of our focus moved to the service dashboard. In the end, running low on disk space on one VM is less important than the whole service is down, and the latter one is the one you should keep an eye on in the first place.

SRE Core Team – who takes responsibility for SLA?

Similar to the SRE Monitoring Team, the Core part was established with a couple of key points in our minds and I will go through the ones that had the biggest impact on the performance of the new team.

The assumption was that while Monitoring Team would be covering the L1 responsibilities, L2 would be covered by the Core Team, working in 8/5 mode, and doing on-calls. At the time of beginning the SRE, our primary source of engineers would be the former L2 support team, focused on a single business domain. One point that draws attention here is “why you expected SRE to behave differently then L2?”.

I think I got a good answer to that question, starting with a single statement, which was the main theme for the team: “SRE owns the SLA exclusively”. You may ask why it’s so important to have a single team responsible for the SLA of the product. In our case, back in the good old days, SLA was something that was of interest to many people across the organization, with no one really taking full responsibility. That meant there was no single place OPS team would get an answer for the simple questions like “Is that time-windows good for rolling out that change?”. There was little understanding that those SLAs really do mean something.

New mindset

With the establishment of the SRE team, we decided that whoever gets woken up in the middle of the night to recover the system, is the one that should hold both the power and the responsibility to allow or reject the change. Who else would best know the effect of the bad decision to roll out the change? Who else would be more impacted by a failed hardware upgrade? Who else would be more interested in having canary deployment capabilities in place?

“You forgot to mention that small little update on the properties file? That’s nice, no problem, but it’s us that will be called at 4 a.m. when the things go bananas– kindly please don’t forget about that next time – and we’ll keep an eye to ensure you don’t ”

SREs as the holders of production system became a final quality gate – if they wouldn’t be convinced it’s a good idea to run the change, that change would be “on-hold” until the requestor delivers enough proofs to change their assessment.

Having the responsibility of the SRE team also has an additional advantage. If you did face the outage yourself, you may have witnessed an interesting phenomenon of what I call SEB – Somebody Else’s Problem. “It’s not our fault the HW went down, HW Ops team please fix that”. I’ve seen many times that having a case of SEB can prolong the service recovery by hours, changing the recovery process into a game of ping-pong blaming.

In our world, it’s not really important who or what caused an outage – it’s always SRE team responsibility to recover the situation, and if the original root cause lays in some other teams responsibility – drive the resolution, ensuring all of the right people being lined up and ready to help.

In the end, we’ll the first one to get awaken in the middle of the night, and that’s what matters. And that leads us to another buzz word: “If you run it, you deploy it”

The art of breaking things

It’s not uncommon in the support organizations to have a split between people that are supporting the product, and the people that deploy the product. It may look attractive to have that kind of setup in place, with a clean split of responsibilities between both parties, a nicely defined set of tools each of those use, etc. We did spot interesting problems in that though. Our principle is to learn the product by hands-on, and I utterly disagree if anyone claims you can provide service support on the basis of documentation alone. Human beings are learning by playing with things and eventually breaking them. Having the split in place I’ve mentioned before pretty much meant that only a single team was granted the pleasure of breaking things – deployment teams. And we had a limited number of ways of passing that experience onto support the teams, so we decided it’s something worth changing. Our SREs are not only responsible for providing support, but they are doing the deployments on both production and staging environments, getting to know the product and its updates before it ends up on production.

We find that extremely useful, limiting the ping-pong between the teams, and making SRE being fully aware of what the change really means. They’ve already had the occasion to play with it, and break it, before it reaches the production. Breaking things, learning in the process carries one major advantage to those willing to tap into that. The support organizations carry a tremendous amount of experience in what the well-designed product really is. It’s the support teams that are exposed to systems that are rock-solid staying up and running for years, and the systems you can take down by clicking on the wrong part of the GUI.

The worst thing you can possibly do is to leave that experience unused when you start to work on the new product or feature. Because of that we came up with the idea of Product Design Reviews. Whenever the development of a new product or major feature is started, the SRE representative is asked to go through the product architecture, metrics the product is exposing, ensuring we got everything that is needed to make the product rock-solid. We typically throw in some difficult questions too, to challenge the architecture decision that was made, to ensure those decisions were solidly backed by a business or technical needs. We all love to play with new technology puzzles, but the important thing is not to put them in just for the sake of the puzzle just being “new and shiny”.

So, you may think all it takes to form up an SRE team is to change the job title for L1 and L2 guys, and you’re done.