Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and building robust distributed systems? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our high-impact team in San Francisco. You will be the architect of our platform's resilience, working at the intersection of software development and systems engineering to ensure our cloud infrastructure thrives under extreme load.
You will play a critical role in shaping our CI/CD pipelines, automating operational toil, and fostering a culture of blameless post-mortems and proactive observability.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS/Kubernetes.
- Automate manual operational tasks using Go, Python, or Terraform to improve system reliability.
- Lead incident response efforts and conduct deep-dive post-mortems to prevent recurrence.
- Optimize cloud resource utilization and cost-efficiency without compromising performance.
- Collaborate with engineering teams to integrate observability best practices into the SDLC.
- Design and implement disaster recovery protocols and failover testing strategies.
- Mentor junior engineers and champion SRE best practices across the engineering organization.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Advanced proficiency with Kubernetes, containerization, and service mesh technologies.
- Expertise in infrastructure-as-code tools such as Terraform, Pulumi, or Crossplane.
- Strong proficiency in at least one modern language: Go, Python, or Java.
- Deep understanding of cloud-native observability stacks (Prometheus, Grafana, ELK/Datadog).
- Proven experience managing high-traffic distributed systems in a production environment.
- Excellent communication skills with the ability to bridge gaps between ops and product teams.