Senior Site Reliability Engineer, Database Operations
GitLab is seeking a Senior Site Reliability Engineer (SRE) to join our Database Operations team. As an SRE, you will be responsible for maintaining and optimizing our database infrastructure, ensuring high availability and performance for our enterprise-scale workloads. You will work with technologies like ClickHouse and PostgreSQL, and leverage automation tools such as Ansible, Terraform, and Kubernetes.
Responsibilities:
- Design, build, and maintain ClickHouse and PostgreSQL clusters.
- Provision and orchestrate cloud infrastructure using configuration management and IaC tools in GCP.
- Design and implement high-availability ClickHouse solutions.
- Optimize and scale high-transaction PostgreSQL clusters.
- Build and maintain monitoring and alerting tools (Prometheus/Grafana).
- Enable cross-database integrations and workflows.
- Respond to platform alerts and user emergencies.
- Enhance infrastructure security.
- Collaborate with engineering teams.
Mandatory Technical Skills:
- Advanced database platform management experience (Postgres and Clickhouse).
- Advanced Cloud Infrastructure automation and management (Ansible, Terraform, Kubernetes).
- Experience with Go, Ruby or Python.
- Advanced experience with Linux.
- Extensive on-call experience.
- Solid incident management experience.
- Experience implementing monitoring at scale (Prometheus and Grafana).
Mandatory Non-Technical Skills:
- Strong communication skills.
- Ability to work under pressure.
- Comfortable working asynchronously.
- Promote GitLab's CREDIT Values.