Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? NexusCloud is seeking a high-impact Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between software development and IT operations, ensuring our global cloud architecture remains resilient and highly performant under heavy load.
You will work alongside elite engineers to automate provisioning, optimize latency, and drive our incident management strategy. If you thrive in a fast-paced environment and love solving complex distributed systems puzzles, we want to hear from you.
Tanggung Jawab
- Design and implement robust monitoring, alerting, and logging systems to ensure 99.99% service availability.
- Lead the automation of infrastructure provisioning and configuration management using Terraform and Ansible.
- Conduct deep-dive post-mortems and root cause analysis for production incidents to prevent recurrence.
- Develop and maintain CI/CD pipelines to streamline deployment velocity and reliability.
- Optimize cloud resource utilization to balance performance needs with cost-efficiency.
- Collaborate with cross-functional product teams to define and meet rigorous Service Level Objectives (SLOs).
- Mentor junior team members on SRE best practices and operational excellence.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or large-scale systems engineering.
- Expert-level proficiency in public cloud environments (AWS, GCP, or Azure).
- Strong hands-on experience with Kubernetes, Docker, and container orchestration at scale.
- Advanced scripting skills in Python, Go, or Ruby for automation and tool development.
- Deep understanding of networking protocols, load balancing, and distributed systems architecture.
- Proven ability to thrive in an on-call rotation and handle incident response effectively.