A Site Reliability Engineer is focused on continuous service and performance improvement around an ecosystem of applications. This entails observing and evaluating the end-to-end performance of applications in production and identifying and resolving issues through increased observability, data insights and automation. The role also involves the management and oversight of application releases through continuous integration and deployment pipelines into production as well as ensuring adherence to operational design principles and production controls.
The role requires collaboration with development, testing, application support and infrastructure teams and an agile mindset and approach to working, where your mission will be to drive operational support overhead to a minimum whilst maximizing stability and availability. All of which strives to strives to achieve business outcomes and value for our clients.
You will be a thought leader that develops the culture of Operations Engineering within the TOC and SRE within the wider Chief Digital and Information Office (CDIO) community and will have practitioner experience with both SRE and DevOps principles and frameworks. You will have a deep understanding of the operational challenges in maintaining and managing complex enterprise applications that span the full stack of business capabilities in a large Enterprise and how they integrate into the ecosystem.
You will be someone who enjoys working in a team environment and has a passion for continuous learning and a desire to improve and share your knowledge across the Technology community.
- You will work in close partnership with development teams ensuring sustainable and reusable design patterns that support efficient production operation. This will typically require participation in hybrid pods that include development and business colleagues.
- As the participant in agile development pods you will have visibility into the build and deployment pipeline around an application set, ensuring that release criteria are being met and the integrity of the production environment is maintained.
- You will understand the SDLC and be able to design, implement and automate processes to defend the integrity, security, and auditability of key risk controls.
- You will ensure the appropriate levels of instrumentation are implemented across an ecosystem of applications and will have in depth knowledge of the monitoring tool landscape particularly with APM toolsets and the ability to interpret and analyze metric data to proactively identify and address potential system issues.
- A primary focus will be to reduce Toil (the kind of work tied to running a service that tends to be manual, repetitive, automatable, tactical & devoid of value), and consequently you will have a high level of scripting or programming capability that will serve as a foundation for using enterprise standard automation and RPA platforms to enable the front to back automated resolution of incidents, request fulfilment and process interactions.
- As a SRE you'll be required to practice key ITIL operational processes and develop incident post-mortems on services focused on improving the availability, scalability, performance, security and efficiency of internal or client facing services
Who we're looking for?
- You will have previous development experience and a deep understanding of DevOps principles and culture.
- Have expertise in the enterprise tooling and service management space - monitoring, event management, robotics and automation.
- Ability to identify improvements to the end user's experience and translate into tangible outcomes which you are capable to drive to conclusion.
- In depth understanding of how applications integrate across a complex enterprise including Cloud native, Hybrid and On-Prem environments.
- Knowledge or experience of defining Error Budgets, Service Level Indicators (SLI's) & Service Level Objectives (SLO's)
- Have deep knowledge and practical experience of the monitoring stack and popular open source and third-party toolsets which provide visibility into Application performance, analytic telemetry fed by system metrics and machine log data.
- Hands on experience and technical knowledge of cloud concepts (Azure, AWS), API frameworks & container technologies.
- Powerful collaboration abilities and demonstrable ability to juggle multiple demands, deliverables and priorities with a capability to self-drive execution of owned - and team owned - activities
- The role requires excellent communication skills to ensure clear and concise dialogue globally with peers and stakeholders.
- You have a passion to expose areas that require improvement to ensure we maximize efficiency, system stability and availability.
- Have a strong character and comfortable to challenge in areas where upholding our key principles is required to achieve business outcomes and value for our clients.