Deskripsi Pekerjaan
Are you obsessed with system uptime and performance at scale? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our high-impact infrastructure team. We operate a global, multi-region cloud architecture and need a proactive engineer to automate, monitor, and scale our services to meet the demands of millions of users.
You will play a pivotal role in bridging the gap between development and operations, fostering a culture of resilience and continuous improvement.
Tanggung Jawab
- Design, build, and maintain highly available, fault-tolerant infrastructure on AWS/GCP.
- Develop automation scripts using Terraform, Ansible, and Python to reduce toil.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Optimize cloud costs through intelligent resource management and capacity planning.
- Implement and manage observability stacks including Prometheus, Grafana, and Datadog.
- Mentor junior engineers on best practices for infrastructure as code (IaC) and system design.
- Collaborate with software engineering teams to ensure service scalability and performance.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering.
- Expert-level proficiency with Kubernetes, Docker, and container orchestration.
- Deep understanding of CI/CD pipelines (GitHub Actions, Jenkins, or GitLab CI).
- Strong scripting and automation skills in Python, Go, or Bash.
- Experience with cloud-native monitoring and logging tools.
- Solid understanding of networking concepts (DNS, Load Balancing, TLS, VPC).
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.