Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? NexusCloud Solutions is seeking a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between development and operations, ensuring our high-traffic global platforms remain resilient, performant, and secure. You will define the future of our cloud-native architecture.
Tanggung Jawab
- Design and maintain highly available, distributed cloud infrastructure on AWS/GCP.
- Automate operational tasks using Infrastructure as Code (Terraform, Ansible).
- Lead incident response, root cause analysis, and post-mortem investigations.
- Optimize CI/CD pipelines to streamline deployment velocity and reliability.
- Implement advanced observability, monitoring, and alerting strategies using Prometheus and Grafana.
- Collaborate with engineering teams to improve system architecture and fault tolerance.
- Manage capacity planning and resource optimization to control cloud costs.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Proficiency in Go, Python, or Ruby for automation and tool development.
- Deep expertise in container orchestration with Kubernetes and Docker.
- Strong background in Linux system internals and networking protocols.
- Proven experience managing large-scale, mission-critical production environments.
- Excellent analytical, problem-solving, and communication skills.