The Site Reliability Engineer is a hybrid engineering role responsible for all aspects of application production support, deployment and monitoring as well as the development of tools to support these activities. The Site Reliability Engineer has the goal of supporting mission critical applications ensuring the highest levels of availability, performance and stability are maintained but crucially, should be designing and building tools and solutions that help to automate as many aspects of support as possible such that manual mundane support activity is reduced or eliminated allowing systems to scale readily without similar linear increases in support overhead. The SRE role requires exceptional development and engineering experience and the ability to apply that knowledge to solve the complex problem of running applications reliably at scale.
The Site Reliability Engineer should spend a considerable portion of his or her time developing new tools to solve existing problems in the production environment - where possible these tools should be designed with reuse across the wider organization in mind.
The Site Reliability Engineer also defends our business and reputation by ensuring the integrity and security of our production platforms at all times
- Partner with service transition managers / development leads and architects to ensure designs of new applications meet expected standards in relation to Site reliability. Ensure non-functional production support requirements are considered early in the lifecycle of all new applications.
- drive the automation agenda across the team, in partnership with the automation group
- ensure workflows, processes, tooling and applications are of the highest quality standard
- support the Service & Product Manager across several technical domains
- contribute expertise to the management of existing and new IT products and services
- define workarounds for known errors and initiate process improvements
- maintain a knowledge database
- assist in planning BCP, patching and other infrastructure related projects
- contribute to automating manual processes
- get involved in problem management to identify root causes and ensure issues don't reoccur
Who we're looking for?
- ideally 5+ years of hands-on experience with Unix, SQL and MS Office products
- Proficient with tools such as ServiceNow, Confluence, JIRA and batch scheduling tools like Autosys
- Hands on experience and technical knowledge of cloud concepts
- Have expertise in the enterprise tooling and service management space - monitoring, event management, robotics and automation
- Azure certification (AZ900 and above)
- Ability to solve complex issues, good at problem statement analysis and solution design thinking
- Track record of building good relationships with senior IT stakeholders and business partners
- confident communicator that can explain technology to non-technical audiences
- capable of understanding client needs and translating this into products and services
- able to work under pressure and manage multiple, concurrent and conflicting priorities to deadlines
- Good understanding of Postgres will be an added advantage