Deskripsi Pekerjaan
Are you obsessed with system availability, latency, and performance? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, building robust automated systems that ensure our global cloud platform stays resilient and scalable.
You will work with a world-class team of engineers to minimize manual toil and maximize the reliability of our distributed services. If you are a problem-solver who thrives in a high-stakes, fast-paced environment, we want to hear from you.
Tanggung Jawab
- Design and maintain highly available, scalable, and secure infrastructure on public cloud providers (AWS/GCP).
- Implement Infrastructure as Code (IaC) solutions using Terraform and Pulumi.
- Develop and manage CI/CD pipelines to streamline deployment velocity and reliability.
- Optimize system monitoring and observability frameworks using Prometheus, Grafana, and ELK stack.
- Conduct deep-dive incident post-mortems and implement permanent fixes for recurring issues.
- Collaborate with SDE teams to improve system architecture for high-performance applications.
- Ensure compliance with security best practices across all production environments.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Expertise in managing distributed systems at scale in a production environment.
- Proficiency in at least one modern programming language (Go, Python, or Java).
- In-depth knowledge of Kubernetes, Docker, and container orchestration.
- Strong background in Linux/Unix system administration and networking (TCP/IP, DNS, Load Balancing).
- Experience with cloud-native monitoring and distributed tracing tools.
- Proven ability to troubleshoot complex, multi-layered system failures under pressure.