Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? Nexus Cloud Infrastructure is looking for a elite Senior Site Reliability Engineer to join our core platform team. In this role, you will bridge the gap between software engineering and systems operations, ensuring our mission-critical services are robust, observable, and lightning-fast.
We operate at a massive scale, processing billions of requests daily. You will be instrumental in designing our next-gen infrastructure automation and optimizing our Kubernetes clusters for maximum efficiency.
Tanggung Jawab
- Architect and maintain highly available distributed systems on GCP/AWS.
- Design and implement sophisticated automation for infrastructure provisioning (IaC) using Terraform.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Define and implement SLOs, SLIs, and error budgets for production services.
- Collaborate with development teams to integrate CI/CD best practices and improve deployment pipelines.
- Optimize system performance, latency, and resource utilization across the stack.
- Mentor junior engineers and advocate for SRE best practices across the organization.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering or DevOps roles at a scale-up environment.
- Deep expertise in Kubernetes, Docker, and container orchestration at scale.
- Proficiency in Go, Python, or Ruby for building automation tools.
- Strong background in cloud infrastructure (AWS or GCP) and networking fundamentals (TCP/IP, DNS, Load Balancing).
- Hands-on experience with observability tools such as Prometheus, Grafana, Datadog, or Honeycomb.
- Solid understanding of CI/CD concepts and tooling (GitHub Actions, ArgoCD, or Jenkins).
- Strong communication skills and the ability to thrive in a collaborative, remote-friendly team environment.