Deskripsi Pekerjaan

Are you obsessed with system uptime and performance at scale? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our high-impact infrastructure team. We operate a global, multi-region cloud architecture and need a proactive engineer to automate, monitor, and scale our services to meet the demands of millions of users.
You will play a pivotal role in bridging the gap between development and operations, fostering a culture of resilience and continuous improvement.

Tanggung Jawab

Design, build, and maintain highly available, fault-tolerant infrastructure on AWS/GCP.
Develop automation scripts using Terraform, Ansible, and Python to reduce toil.
Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
Optimize cloud costs through intelligent resource management and capacity planning.
Implement and manage observability stacks including Prometheus, Grafana, and Datadog.
Mentor junior engineers on best practices for infrastructure as code (IaC) and system design.
Collaborate with software engineering teams to ensure service scalability and performance.

Kualifikasi

5+ years of experience in SRE, DevOps, or large-scale Systems Engineering.
Expert-level proficiency with Kubernetes, Docker, and container orchestration.
Deep understanding of CI/CD pipelines (GitHub Actions, Jenkins, or GitLab CI).
Strong scripting and automation skills in Python, Go, or Bash.
Experience with cloud-native monitoring and logging tools.
Solid understanding of networking concepts (DNS, Load Balancing, TLS, VPC).
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.

Senior Site Reliability Engineer (SRE)

Deskripsi Pekerjaan

Tanggung Jawab

Kualifikasi

Keahlian yang Dibutuhkan

Siap Mengambil Tantangan Ini?

Lowongan Terkait

Backend Software Engineer

Senior Data Scientist

Senior AI/Machine Learning Engineer

AI Engineer

Senior AI/ML Engineer