Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? Nexus Cloud Infrastructure is looking for a Senior Site Reliability Engineer to join our core platform team. We manage high-traffic distributed systems and need an expert to bridge the gap between software development and IT operations. You will play a pivotal role in designing robust infrastructure that powers mission-critical applications for our global clients.
We value engineers who automate everything, treat infrastructure as code, and thrive in complex, high-pressure environments.
Tanggung Jawab
- Design, implement, and maintain highly available, scalable, and secure cloud infrastructure on AWS/GCP.
- Drive capacity planning, performance tuning, and cost optimization initiatives across our microservices architecture.
- Lead incident response and root-cause analysis efforts for production outages, implementing long-term preventative measures.
- Automate manual operational workflows using Python, Go, or shell scripting.
- Develop and manage CI/CD pipelines to ensure rapid, reliable code deployments.
- Establish and enforce SLOs, SLAs, and SLIs to maintain system reliability standards.
- Mentor junior engineers and promote a culture of operational excellence across the engineering department.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Deep proficiency in cloud platforms (AWS or GCP) and container orchestration tools like Kubernetes.
- Strong background in Linux system administration and networking (TCP/IP, DNS, Load Balancing).
- Experience with Infrastructure as Code (Terraform, CloudFormation, or Ansible).
- Expertise in observability and monitoring tools like Prometheus, Grafana, Datadog, or ELK Stack.
- Proven ability to troubleshoot complex distributed systems in a production environment.
- Proficiency in at least one high-level programming language such as Go, Python, or Ruby.