Deskripsi Pekerjaan
Are you obsessed with uptime, performance, and building resilient distributed systems? NexusScale is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will bridge the gap between software development and IT operations, ensuring our global cloud infrastructure remains highly available, scalable, and secure. You will work at the intersection of architecture and automation to solve complex challenges at a massive scale.
Tanggung Jawab
- Design, implement, and maintain highly available distributed systems on AWS/GCP.
- Drive capacity planning, performance analysis, and optimization of production environments.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Lead incident response and conduct thorough post-mortem analyses to prevent recurrence.
- Develop and maintain CI/CD pipelines to ensure seamless, reliable code deployments.
- Implement proactive monitoring, logging, and alerting strategies using Prometheus, Grafana, and ELK.
- Collaborate with engineering teams to promote best practices in observability and service reliability.
Kualifikasi
- BS/MS in Computer Science, Engineering, or a related technical field.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Proficiency in programming/scripting (Python, Go, or Bash).
- Deep expertise in Kubernetes, Docker, and container orchestration at scale.
- Strong background in Linux system administration and network fundamentals.
- Hands-on experience with cloud-native infrastructure and Infrastructure as Code (IaC).
- Proven ability to troubleshoot and resolve complex production issues under pressure.
- Excellent communication skills with the ability to articulate technical debt to stakeholders.