Deskripsi Pekerjaan
Are you passionate about building resilient, high-scale distributed systems? CloudScale Systems is looking for a Senior SRE to join our infrastructure team in San Francisco. You will play a critical role in designing, implementing, and maintaining the reliability, scalability, and performance of our global cloud architecture.
We value engineers who view operations as a software engineering problem. If you thrive on automation, observability, and solving complex production incidents, we want to hear from you.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS.
- Automate operational tasks using Infrastructure as Code (Terraform, Ansible) to minimize manual intervention.
- Drive incident management and perform root cause analysis (RCA) to improve system reliability.
- Optimize system performance, latency, and resource utilization across our microservices architecture.
- Develop and manage observability stacks including Prometheus, Grafana, and ELK.
- Collaborate with engineering teams to promote SRE best practices and error budget management.
- Participate in an on-call rotation to ensure 99.99% service uptime.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Strong proficiency in Linux system administration and networking fundamentals (TCP/IP, DNS, Load Balancing).
- Advanced experience with container orchestration using Kubernetes.
- Expert-level coding skills in Go, Python, or Ruby.
- Deep understanding of AWS services (EKS, RDS, Aurora, VPC).
- Solid experience with CI/CD pipelines (GitHub Actions, Jenkins, or similar).
- Excellent problem-solving skills and the ability to operate under pressure.