Deskripsi Pekerjaan
Are you obsessed with system uptime and scalability? Join CloudScale Innovations, where we are building the next generation of high-availability cloud infrastructure. We are seeking a Senior Site Reliability Engineer to help us bridge the gap between development and operations, ensuring our global services remain resilient, performant, and secure.
You will play a pivotal role in automating infrastructure, optimizing CI/CD pipelines, and driving a culture of blameless post-mortems.
Tanggung Jawab
- Design, implement, and maintain highly available and scalable cloud infrastructure.
- Automate manual operational tasks using Infrastructure as Code (Terraform, Ansible).
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Optimize cloud costs and performance through rigorous monitoring and capacity planning.
- Collaborate with cross-functional engineering teams to integrate reliability best practices into the SDLC.
- Participate in an on-call rotation to ensure 99.99% service availability.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in AWS or GCP cloud services and Kubernetes orchestration.
- Proficiency in programming with Python, Go, or Ruby for infrastructure automation.
- Strong grasp of Linux internals, networking, and distributed systems architecture.
- Experience with observability stacks like Prometheus, Grafana, and Datadog.
- Excellent communication skills and the ability to thrive in a fast-paced, collaborative environment.