Deskripsi Pekerjaan
Are you obsessed with system reliability and massive scale? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our high-impact platform engineering team. We are building the next generation of cloud infrastructure to support millions of concurrent users. In this role, you will bridge the gap between development and operations, ensuring our services are resilient, performant, and secure.
You will work with a world-class team of engineers to automate the boring stuff, eliminate toil, and define the future of our production environment.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS/GCP.
- Drive incident response and post-mortem analysis to prevent recurrence of system outages.
- Develop and maintain CI/CD pipelines to streamline deployment workflows.
- Implement Infrastructure as Code (IaC) using Terraform and Pulumi.
- Monitor system performance, define SLOs/SLIs, and optimize resource utilization.
- Collaborate with engineering teams to improve software observability and logging standards.
- Mentor junior engineers on best practices for distributed systems.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep proficiency in Kubernetes, Docker, and container orchestration at scale.
- Expertise in at least one scripting language: Python, Go, or Ruby.
- Hands-on experience with cloud infrastructure (AWS preferred) and IaC tools.
- Strong understanding of Linux internals, networking, and distributed systems.
- Experience with monitoring tools like Prometheus, Grafana, or Datadog.
- Excellent communication skills and ability to thrive in an on-call, high-pressure environment.