Deskripsi Pekerjaan
Are you an expert in distributed systems and automation? NexusScale Systems is looking for a Senior Site Reliability Engineer to help us build and scale the backbone of our cloud infrastructure. In this role, you will bridge the gap between development and operations, ensuring our high-traffic platforms remain resilient, performant, and secure.
You will work alongside elite engineering teams to implement Infrastructure as Code (IaC), optimize cloud costs, and lead incident response strategies for our global user base.
Tanggung Jawab
- Design and maintain highly available, fault-tolerant production infrastructure on AWS/GCP.
- Automate manual operational tasks using Python, Go, or Bash to increase system efficiency.
- Lead post-mortem analysis and incident management for complex service disruptions.
- Optimize CI/CD pipelines to ensure seamless, reliable deployment cycles.
- Collaborate with engineering teams to set and maintain rigorous SLOs and SLAs.
- Manage capacity planning and resource allocation to ensure optimal performance under load.
- Implement robust monitoring, alerting, and observability solutions using tools like Prometheus and Grafana.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering.
- Deep expertise in container orchestration with Kubernetes and Docker.
- Strong proficiency in Infrastructure as Code (Terraform or CloudFormation).
- Advanced knowledge of cloud platforms (AWS, GCP, or Azure).
- Proven experience with observability tools (Datadog, Prometheus, Splunk).
- Strong communication skills and a passion for fostering a DevOps culture.