Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and automated recovery? CloudScale Systems is looking for a elite Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our high-traffic global platforms remain scalable, resilient, and performant.
We value engineers who view operations as a software engineering problem. You will work on bleeding-edge infrastructure, contributing to our Kubernetes-native ecosystem while mentoring a high-performing team of developers.
Tanggung Jawab
- Design, build, and maintain highly available, distributed cloud infrastructure.
- Automate operational tasks through CI/CD pipelines and infrastructure-as-code (Terraform/Pulumi).
- Define and implement Service Level Objectives (SLOs) and Error Budgets.
- Lead incident response for critical production issues and perform detailed blameless post-mortems.
- Optimize system performance, latency, and resource utilization across our multi-region footprint.
- Collaborate with software engineering squads to improve service reliability during the design phase.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expert-level proficiency in cloud-native platforms (AWS, GCP, or Azure).
- Deep hands-on experience with Kubernetes, Helm, and service mesh technologies (Istio/Linkerd).
- Strong proficiency in at least one high-level language such as Go, Python, or Ruby.
- Solid understanding of observability tools like Prometheus, Grafana, Datadog, or Honeycomb.
- Proven expertise in managing distributed databases and caching layers (Redis, PostgreSQL, Cassandra).
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.