Engineering Manager, SRE - Observability

As an Engineering Manager specializing in Observability, you will lead and scale a highly skilled team responsible for architecting, building, and evolving enterprise-grade monitoring, alerting, and incident response systems. Leveraging your deep expertise with observability tools such as Datadog, Grafana, Loki, and others, you will drive our transformation from reactive firefighting to proactive reliability engineering at scale. Your mission is to empower engineering teams by providing the right visibility and tooling to ensure system health, availability, and performance.

You will collaborate closely with Product Management and Technical Leads to define and execute a strategic roadmap that addresses the challenges of monitoring complex, large-scale distributed systems in a cloud-native environment. This role demands a hands-on engineering leader who understands the nuances of telemetry data, visualization, alerting reliability, and cost-efficient observability architectures in enterprise settings.

What You’ll Be Doing

  • Recruit, mentor, and retain top engineering talent specialized in observability and reliability engineering.
  • Directly contribute to the design and implementation of observability solutions alongside your team, maintaining a high bar for technical excellence.
  • Own and evolve the end-to-end observability stack and operational processes, including metrics, traces, logs, dashboards, and alerting.
  • Partner with SRE, DevOps, and platform teams to integrate and extend observability tooling across diverse services running at large scale.
  • Lead roadmap planning for observability infrastructure and tooling in partnership with Product and Engineering leadership.
  • Establish best practices for instrumentation, data collection, alerting thresholds, and incident response workflows to elevate the organization's reliability posture.
  • Identify gaps and weaknesses in monitoring coverage and performance; proactively drive improvements and automation.
  • Collaborate cross-functionally with teams across the enterprise to influence observability adoption, standardization, and innovation.
  • Foster a culture of continuous learning, high team engagement, and technical craftsmanship within your team.
  • Communicate technical strategy, progress, risks, and impact effectively with stakeholders at all levels.

Our Tech Environment

  • Primarily AWS cloud infrastructure with Kubernetes orchestration.
  • Codebase spans Ruby, Go, and Python.
  • Data storage includes AWS Aurora (MySQL), S3, and Kafka streaming.
  • Observability responsibilities include balancing operational maintenance, tooling innovation, and incident support.

The intelligent heart of customer experience

Zendesk software was built to bring a sense of calm to the chaotic world of customer service. Today we power billions of conversations with brands you know and love.

Zendesk believes in offering our people a fulfilling and inclusive experience. Our hybrid way of working, enables us to purposefully come together in person, at one of our many Zendesk offices around the world, to connect, collaborate and learn whilst also giving our people the flexibility to work remotely for part of the week.

This role must attend our local office for part of the week. The specific in-office schedule is to be determined by the hiring manager.

The Poland annualized base salary range for this position is zł297,000.00-zł445,000.00. Please note that while the salary range represents the minimum and maximum base salary rate for this position, the actual compensation offered will be based on job related capabilities, applicable experience, and other relevant factors. This position may also be eligible for bonus, benefits, or related incentives that will be communicated during the offer stage.

Hybrid: In this role, our hybrid experience is designed at the team level to give you a rich onsite experience packed with connection, collaboration, learning, and celebration - while also giving you flexibility to work remotely for part of the week.

What You Bring to the Role

  • Deep hands-on experience with commercial and open-source observability tools, including Datadog, Grafana, Loki, and related telemetry technologies.
  • Proven track record managing observability or SRE teams within large, complex enterprise environments.
  • Strong understanding of distributed systems, cloud-native architectures (Kubernetes, AWS), and how observability fits into scalable operations.
  • Ability to provide technical leadership while actively contributing to engineering solutions and troubleshooting.
  • Expertise in designing scalable, reliable telemetry pipelines and intelligent alerting to reduce alert noise and incident toil.
  • Demonstrated skill in building and improving observability platforms that serve multiple engineering teams and business units.
  • Effective communicator and collaborator, able to bridge engineering, product, and business stakeholders.
  • Commitment to developing team members through coaching, feedback, and career growth opportunities.
  • Experience driving cultural change in organizations towards proactive reliability engineering and data-driven decision making.

Required

  • 3+ years of people management experience leading engineering teams.
  • Deep domains expertise in Observability with hands-on experience in tools like Datadog, Grafana, Loki, etc.
  • Significant experience working in or managing engineering teams within large-scale enterprise companies.
  • Proven ability to hire, mentor, and retain high-performing engineers.
  • Strong collaboration skills to influence cross-functional teams in large engineering organizations.
  • Experience with distributed systems and cloud environments (AWS, Kubernetes).

Preferred

  • Background leading Observability focused teams.
  • Hands-on experience operating telemetry systems for large-scale Kubernetes and AWS workloads.
  • Passion for innovation, continuous learning, and championing a growth mindset.
  • Experience managing geographically distributed teams.