Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and distributed systems? NexusCloud Systems is looking for a elite Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our global high-traffic platforms remain performant and resilient under heavy load. You will design, build, and maintain the systems that power our cloud-native environment, driving automation and reliability across the board.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS.
- Automate manual operational tasks using Infrastructure as Code (Terraform, Pulumi).
- Lead incident response and perform deep-dive blameless post-mortems for production issues.
- Optimize system performance, latency, and resource utilization through rigorous monitoring.
- Develop and manage CI/CD pipelines to facilitate rapid, safe deployment cycles.
- Collaborate with product engineering teams to influence design choices for reliability.
- Implement robust security measures and compliance protocols across our infrastructure.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering at a high-growth tech company.
- Expertise in cloud infrastructure (AWS preferred) and container orchestration (Kubernetes).
- Strong proficiency in Go, Python, or Ruby for automation and tool development.
- Deep understanding of observability tools like Prometheus, Grafana, Datadog, or New Relic.
- Solid grasp of networking fundamentals (DNS, Load Balancing, TLS, HTTP/S).
- Proven experience with configuration management (Ansible, Chef) and CI/CD tools.
- Ability to participate in an on-call rotation to maintain 99.99% system availability.