Gartner says site reliability engineering (SRE) enables organizations to fulfil their reliability needs at scale to support the demands of digital business.
Site reliability engineering (SRE) promises methods to improve resiliency in organizations while they pursue agility in digital transformation.
Gartner says organisations are exploring site reliability engineering (SRE) as a way to balance reliability and change velocity with modern microservices and multi-cloud-based architectures. I&O technical professionals can use this research as an extensive assessment of SRE principles.
However, many I&O leaders remain unfamiliar with SRE concepts, making it difficult to implement SRE methods or hire or train for SRE roles.
Another way of looking at SRE
Wes Hummel, vice president of site reliability and cloud engineering at PayPal says if you look historically at how companies have managed their infrastructure and managed their applications, typically it was in teams called operations.
Operations monitor the site, see how production is performing, but often were very reactive in dealing with issues and problems and gaps in resiliency. What those teams would do is find a way to get the site back to a healthy state but didn't really necessarily pursue the systemic root causes that were causing those problems to exist.
“The analogy I like to use is, operations folks will often put their fingers in the dam when there's a leak; site reliability engineers will figure out a way to build a new dam so it doesn't leak,” he added.
How SRE operates inside PayPal
Using a high-performance car as an analogy, he explained that the cloud engineering part of PayPal is the team that “builds that high performance engine that runs that car and ensures that we are as fast as we can be and that everything is working very well.
“And the site reliability portion of that are the people that are watching all of the gauges and making sure that the speed is keeping at the levels we need them at.
“They are figuring out how do we make the tweaks; how do we make the changes; how do we make the systemic fixes that we need to ensure that the engine continues to perform and that we can go faster and faster and get our customers where they need to go.” Elaborated Hummel.
How to cut back on the stress
The digital economy has meant that businesses operate nearly 24x7, seven days a week. In the B2C space this is greater expectation by customers for vendors to response quickly and in real-time. This means that failures, glitches need to be identified and solved quickly.
Hummel acknowledged that it can be a stressful job being part of the SRE team but there are ways to minimise these situations.
“But the way to get there is to really understand the architecture of the underlying system – to have folks who have that engineering mindset to understand how the network connections are created and the architecture behind that, how our computer works, how our data works.”
He hinted that its not just about fixing issues but doing post-mortem reports to understand what went wrong; how did it go wrong; why did why did it go wrong; how can it be architected better for that; including faster monitoring and alerting and telemetry that identify problems even before customers see them.
“The way to get the stress out is to continue to make the system more and more resilient, so that we don't have the issues,” he opined.
Designing a technology-agnostic, resilient architecture
How do you design and sustain an architecture that delivers the reliability on prem, the performance of the edge, and the availability of a cloud?
To which Hummel added that complicating matters will be modernising an existing architecture that continuous to evolve over time.
He added that this modernisation gets complicated when you have an organisation that grows as much through acquisition as it does through organic development.
“It really takes an architecture-first view – how do we plan for the future; how do we satisfy the customers we need now; and how do we ensure that whatever we are building has that scalability and can be in any region or anywhere that our customer is going to be,” he continued.
Click on the Podchat player and listen to the candid discussion with Hummel on how SRE answers the operational resiliency aspirations of enterprises post-COVID-19.