Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and scalable architecture? CloudScale Systems is seeking a visionary Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, building robust automation that keeps our global platform resilient, performant, and secure.
You will play a pivotal role in shaping our cloud-native strategy and mentoring a team of high-performing engineers while working with cutting-edge technologies to solve complex distributed systems challenges.
Tanggung Jawab
- Design, implement, and maintain highly available, distributed cloud infrastructure.
- Automate operational tasks using Go, Python, or Ruby to eliminate manual toil.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Optimize cloud resource utilization and cost-efficiency across AWS and Kubernetes environments.
- Collaborate with engineering squads to improve CI/CD pipelines and deployment strategies.
- Define and track Service Level Objectives (SLOs) and error budgets to ensure system stability.
- Mentor junior engineers and advocate for SRE best practices across the organization.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent professional experience.
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Expert-level proficiency in Kubernetes, Docker, and cloud-native service orchestration.
- Deep experience with public cloud platforms (AWS, GCP, or Azure).
- Strong proficiency in infrastructure-as-code tools such as Terraform or Pulumi.
- Advanced knowledge of observability stacks (Prometheus, Grafana, Datadog, or Honeycomb).
- Strong debugging skills across the full stack, from network protocols to application code.