Deskripsi Pekerjaan
Are you obsessed with system reliability, performance optimization, and building robust, scalable cloud infrastructure? Nexus Cloud Systems is seeking a Senior Site Reliability Engineer to join our high-impact SRE team in the heart of San Francisco. In this role, you will bridge the gap between development and operations, ensuring our global platforms remain resilient, secure, and performant at scale.
You will leverage your deep expertise in distributed systems and automation to minimize toil, drive incident response, and shape our long-term infrastructure roadmap.
Tanggung Jawab
- Design, implement, and maintain highly available, distributed cloud infrastructure on AWS/GCP.
- Drive capacity planning, performance tuning, and cost-optimization initiatives.
- Lead post-mortem analysis and incident response for critical production issues.
- Develop and maintain CI/CD pipelines to facilitate seamless, automated deployments.
- Implement Infrastructure as Code (IaC) best practices using Terraform or Pulumi.
- Collaborate with cross-functional engineering teams to embed reliability standards into the SDLC.
- Mentor junior engineers and champion a culture of operational excellence and blamelessness.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or large-scale Systems Engineering roles.
- Advanced proficiency in programming with Go, Python, or Ruby.
- Deep understanding of container orchestration platforms, specifically Kubernetes.
- Expert-level experience managing cloud providers (AWS, GCP, or Azure).
- Proven track record with observability tools such as Prometheus, Grafana, Datadog, or Honeycomb.
- Strong background in Linux internals, networking protocols, and security principles.