Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and automating the mundane? CloudScale Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will play a pivotal role in designing, building, and maintaining our global cloud-native architecture. We value engineers who view operations as a software problem and thrive in high-stakes environments.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS/Kubernetes.
- Implement Infrastructure as Code (IaC) using Terraform to ensure environment consistency.
- Develop and automate CI/CD pipelines to streamline deployment velocity.
- Lead incident response and perform deep-dive post-mortem analyses to prevent recurrence.
- Optimize system performance and resource utilization to manage cloud infrastructure costs.
- Define and implement Service Level Objectives (SLOs) and Error Budgets.
- Mentor junior engineers on best practices for observability and system reliability.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expertise in container orchestration with Kubernetes and Docker in production environments.
- Strong proficiency in scripting/coding (Python, Go, or Bash).
- Deep understanding of cloud networking, security, and storage on AWS.
- Demonstrated experience with monitoring and observability stacks (Prometheus, Grafana, ELK, or Datadog).
- Strong analytical skills with a proactive approach to troubleshooting complex distributed systems.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.