Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? Nexus Cloud Infrastructure is looking for a elite Senior Site Reliability Engineer to join our core platform team in San Francisco. You will play a pivotal role in designing, building, and maintaining the systems that power our global, high-traffic SaaS ecosystem. In this role, you will bridge the gap between development and operations, ensuring our services are resilient, efficient, and capable of handling millions of concurrent requests.
Tanggung Jawab
- Architect and maintain high-availability cloud infrastructure on AWS/GCP.
- Automate manual operational processes using Infrastructure as Code (Terraform, Ansible).
- Lead incident response and perform deep-dive post-mortems to identify root causes.
- Optimize cloud resource utilization to balance cost-efficiency with high performance.
- Develop and maintain comprehensive monitoring and alerting strategies using Prometheus and Grafana.
- Collaborate with engineering teams to integrate reliability best practices into the CI/CD pipeline.
- Participate in an on-call rotation to ensure 99.99% system availability.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or systems engineering roles.
- Advanced proficiency in container orchestration using Kubernetes.
- Expert-level experience with public cloud platforms (AWS or GCP).
- Strong scripting and automation skills in Python, Go, or Ruby.
- In-depth knowledge of distributed systems and microservices architecture.
- Proven experience with database scaling (PostgreSQL, Redis, or NoSQL).
- Experience with observability tools (Datadog, New Relic, or ELK Stack).
- BS/MS in Computer Science, Engineering, or equivalent practical experience.