Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and the elegance of automated infrastructure? CloudScale Innovations is seeking a high-impact Senior Site Reliability Engineer to join our core platform team. You will be instrumental in architecting the next generation of our global cloud infrastructure, ensuring 99.999% availability for our mission-critical SaaS products.
You will work at the intersection of software engineering and systems operations, bridging the gap between development and production environments in a high-velocity environment.
Tanggung Jawab
- Design, build, and maintain scalable, highly available distributed systems on AWS.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Champion SRE best practices, including SLO/SLI definition, error budgets, and blameless post-mortems.
- Drive incident response and conduct deep-dive root cause analysis (RCA) for complex production issues.
- Optimize system performance and resource utilization to ensure cost-efficiency at scale.
- Collaborate with DevOps and Software Engineering teams to improve CI/CD pipelines and deployment safety.
- Mentor junior engineers and foster a culture of engineering excellence.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Expertise in cloud infrastructure (AWS/GCP) and container orchestration platforms (Kubernetes).
- Strong proficiency in programming languages such as Go, Python, or Ruby.
- In-depth knowledge of Linux internals, networking, and security protocols.
- Proven experience with Infrastructure as Code (Terraform, CloudFormation).
- Experience with observability and monitoring stacks like Prometheus, Grafana, and Datadog.
- Excellent problem-solving skills and the ability to thrive in a distributed, remote-friendly team environment.