Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and building robust, distributed systems? CloudScale Innovations is seeking a visionary Senior Site Reliability Engineer to join our high-performance infrastructure team. You will be the bridge between development and operations, ensuring our global platforms remain resilient, performant, and secure.
In this role, you will define the standards for reliability, drive automation initiatives, and tackle complex architectural challenges in a mission-critical cloud environment.
Tanggung Jawab
- Design, implement, and maintain highly available, scalable, and secure cloud infrastructure.
- Automate manual operational tasks using Infrastructure as Code (IaC) tools like Terraform and Pulumi.
- Lead incident response efforts, conduct blameless post-mortems, and identify root causes for system failures.
- Optimize system performance, resource utilization, and operational costs across multi-cloud environments.
- Develop and maintain comprehensive monitoring, alerting, and observability solutions (Prometheus, Grafana, Datadog).
- Mentor junior engineers and promote a culture of operational excellence across the engineering organization.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Architecture.
- Expert-level proficiency in public cloud platforms (AWS, GCP, or Azure).
- Strong scripting and automation skills in Python, Go, or Ruby.
- Deep understanding of container orchestration platforms, specifically Kubernetes.
- Solid grasp of CI/CD pipelines and deployment strategies (Canary, Blue/Green).
- Proven experience with distributed systems and large-scale data storage solutions.
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.