Deskripsi Pekerjaan

Are you an expert in architecting highly scalable, resilient infrastructure? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our core engineering team in San Francisco. You will play a pivotal role in ensuring the availability, latency, performance, and efficiency of our global cloud platform.
We are looking for a forward-thinking engineer who views operations as a software engineering problem. You will work alongside our Product and Infrastructure teams to build the next generation of our automated deployment pipelines.

Tanggung Jawab

Design and maintain robust, scalable infrastructure on AWS and Kubernetes.
Automate operational tasks using Go, Python, or Terraform to reduce toil.
Implement comprehensive monitoring, alerting, and observability solutions using Prometheus and Grafana.
Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
Optimize cloud costs and infrastructure performance through deep analysis and architectural improvements.
Collaborate with development teams to integrate CI/CD best practices and shift-left security.
Mentor junior engineers and promote a culture of operational excellence.

Kualifikasi

5+ years of experience in SRE, DevOps, or Systems Engineering roles.
Deep proficiency in Kubernetes orchestration and containerization (Docker).
Expert-level knowledge of AWS ecosystem (EC2, EKS, RDS, Lambda).
Strong programming skills in Go, Python, or Ruby.
Hands-on experience with Infrastructure as Code (Terraform, CloudFormation).
Deep understanding of distributed systems, networking protocols, and Linux internals.
Proven ability to troubleshoot complex performance bottlenecks in a high-traffic production environment.

Senior Site Reliability Engineer (SRE)

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Backend Software Engineer

Senior Data Scientist

Senior AI/Machine Learning Engineer

AI Engineer

Senior AI/ML Engineer