Deskripsi Pekerjaan
Are you obsessed with uptime, performance, and building resilient systems at scale? NexusScale is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, ensuring our mission-critical services remain highly available and performant for millions of users worldwide.
You will work in a high-impact, cloud-native environment where you'll have the autonomy to architect solutions that define the future of our platform.
Tanggung Jawab
- Design, implement, and maintain highly available distributed systems on AWS and GCP.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Champion SRE best practices, including error budgets, incident response, and post-mortems.
- Scale our container orchestration platform (Kubernetes) to support rapid service growth.
- Improve system observability through advanced logging, monitoring, and tracing stacks (Datadog, Prometheus, Grafana).
- Collaborate with cross-functional engineering teams to optimize application performance and latency.
- Manage capacity planning and perform proactive performance tuning for critical backend services.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or high-scale systems engineering.
- Deep proficiency in Kubernetes, including cluster management and troubleshooting.
- Strong development skills in Go, Python, or Java with a focus on writing maintainable automation scripts.
- Extensive experience with Infrastructure as Code (IaC) tooling, specifically Terraform.
- Proven expertise in managing large-scale production environments in cloud providers (AWS/GCP/Azure).
- Strong understanding of Linux internals, networking, and security best practices.
- Ability to participate in an on-call rotation and lead complex incident resolution processes.