Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the architect of our reliability strategy, bridging the gap between development and operations to ensure our global platform delivers seamless experiences to millions of users.
You will work in a high-impact environment where engineering rigor, automation, and proactive problem-solving are at the forefront of our success.
Tanggung Jawab
- Design and maintain highly available, scalable infrastructure on AWS and Kubernetes.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Automate manual operational tasks to minimize 'toil' through Infrastructure as Code (Terraform/Pulumi).
- Define and implement Service Level Objectives (SLOs) and Error Budgets for critical microservices.
- Collaborate with cross-functional teams to improve CI/CD pipelines and deployment velocity.
- Proactively monitor system performance and capacity planning to prevent bottlenecks.
- Mentor junior SREs and promote a culture of reliability engineering best practices.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Deep expertise in Kubernetes (K8s) cluster management and container orchestration.
- Proficiency in at least one modern language: Go, Python, or Java.
- Advanced experience with cloud infrastructure (AWS/GCP/Azure) and IaC tools (Terraform).
- Solid understanding of observability tools like Prometheus, Grafana, Datadog, or Honeycomb.
- Strong grasp of Linux internals, networking protocols (TCP/IP, DNS, HTTP), and security.
- Excellent communication skills with the ability to explain complex technical concepts to non-technical stakeholders.