Deskripsi Pekerjaan
Are you obsessed with system uptime and performance at scale? Nexus Cloud Infrastructure is looking for a Senior Site Reliability Engineer to join our core engineering team in San Francisco. You will be the architect behind our global cloud footprint, ensuring our services are resilient, observable, and lightning-fast.
We operate a massive Kubernetes environment and believe in eliminating toil through relentless automation and robust infrastructure-as-code practices.
Tanggung Jawab
- Design, build, and maintain highly available, scalable, and secure cloud infrastructure.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Implement advanced monitoring, alerting, and observability strategies using Prometheus, Grafana, and ELK.
- Automate manual operational tasks using Go, Python, or Terraform to reduce toil.
- Collaborate with development teams to optimize application performance and deployment velocity.
- Manage capacity planning and resource optimization to ensure cost-efficiency.
- Drive the adoption of Site Reliability Engineering best practices across the engineering organization.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Deep expertise in Kubernetes orchestration and containerization (Docker).
- Proficiency in at least one infrastructure-as-code tool, preferably Terraform.
- Strong programming skills in Go, Python, or Ruby for automation and tool development.
- Experience managing cloud providers (AWS, GCP, or Azure) at scale.
- Solid understanding of distributed systems, networking (TCP/IP, HTTP/S, DNS), and Linux internals.
- Demonstrated ability to communicate complex technical concepts to cross-functional stakeholders.