Deskripsi Pekerjaan
Are you obsessed with uptime, system performance, and building resilient infrastructure? CloudScale Innovations is seeking a Senior Site Reliability Engineer to join our core platform team. You will be the architect of our reliability strategy, ensuring that our globally distributed cloud services remain stable, scalable, and secure. We are looking for an engineer who thrives in the intersection of software development and systems operations.
You will play a pivotal role in evolving our infrastructure-as-code practices and automating the operational lifecycle of our mission-critical applications.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure using Terraform and Kubernetes.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Implement observability and monitoring solutions to ensure deep visibility into system performance (Prometheus, Grafana, ELK).
- Automate manual operational tasks through robust CI/CD pipelines and scripting (Python/Go).
- Collaborate with product and engineering teams to define and meet ambitious Service Level Objectives (SLOs).
- Mentor junior engineers and champion SRE best practices across the organization.
- Optimize cloud resource utilization to balance performance with cost efficiency.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in managing large-scale Kubernetes clusters in production (EKS/GKE).
- Proficiency in at least one modern programming language (Go, Python, or Java).
- Expert-level knowledge of AWS or GCP cloud services and networking fundamentals.
- Strong background in IaC (Terraform, CloudFormation, or Pulumi).
- Proven ability to troubleshoot complex distributed systems in high-traffic environments.