Deskripsi Pekerjaan

Are you obsessed with system stability, scalability, and performance? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. We operate at massive scale, and we need your expertise to optimize our cloud architecture, automate incident response, and ensure 99.999% uptime for our global user base.
You will work at the intersection of software engineering and systems operations, building the tools that empower our developers to ship code faster and safer.

Tanggung Jawab

Architect and maintain highly available, scalable, and secure cloud infrastructure.
Automate manual operational tasks using Python, Go, or Terraform.
Lead incident response and perform deep-dive post-mortems to prevent recurrence.
Design and implement observability stacks (Prometheus, Grafana, ELK) to monitor system health.
Collaborate with engineering teams to optimize application performance and resource utilization.
Establish CI/CD pipelines to streamline deployment velocity.
Mentor junior engineers and foster a culture of reliability throughout the engineering organization.

Kualifikasi

5+ years of experience in SRE, DevOps, or Systems Engineering roles.
Expert-level proficiency with AWS, GCP, or Azure.
Deep understanding of container orchestration using Kubernetes.
Strong programming skills in Python, Go, or Bash.
Experience with Infrastructure as Code (Terraform, CloudFormation, Ansible).
Solid understanding of networking protocols (TCP/IP, DNS, Load Balancing, SSL/TLS).
Proven ability to troubleshoot complex, distributed systems in a high-pressure environment.

Senior Site Reliability Engineer (SRE)

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Backend Software Engineer

Senior Data Scientist

Senior AI/Machine Learning Engineer

AI Engineer

Senior AI/ML Engineer