Deskripsi Pekerjaan
Are you an expert in architecting highly scalable, resilient infrastructure? NexusCloud Systems is seeking a Senior Site Reliability Engineer to join our core engineering team in San Francisco. You will play a pivotal role in ensuring the availability, latency, performance, and efficiency of our global cloud platform.
We are looking for a forward-thinking engineer who views operations as a software engineering problem. You will work alongside our Product and Infrastructure teams to build the next generation of our automated deployment pipelines.
Tanggung Jawab
- Design and maintain robust, scalable infrastructure on AWS and Kubernetes.
- Automate operational tasks using Go, Python, or Terraform to reduce toil.
- Implement comprehensive monitoring, alerting, and observability solutions using Prometheus and Grafana.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Optimize cloud costs and infrastructure performance through deep analysis and architectural improvements.
- Collaborate with development teams to integrate CI/CD best practices and shift-left security.
- Mentor junior engineers and promote a culture of operational excellence.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep proficiency in Kubernetes orchestration and containerization (Docker).
- Expert-level knowledge of AWS ecosystem (EC2, EKS, RDS, Lambda).
- Strong programming skills in Go, Python, or Ruby.
- Hands-on experience with Infrastructure as Code (Terraform, CloudFormation).
- Deep understanding of distributed systems, networking protocols, and Linux internals.
- Proven ability to troubleshoot complex performance bottlenecks in a high-traffic production environment.