Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and system performance? NexusScale is looking for a elite Senior Site Reliability Engineer to join our core infrastructure team in the heart of San Francisco. You will play a pivotal role in designing, building, and maintaining our mission-critical distributed systems, ensuring our global customer base experiences industry-leading reliability.
You will work at the intersection of software engineering and systems operations, bridging the gap between development and production environments using modern Infrastructure-as-Code (IaC) principles.
Tanggung Jawab
- Architect and maintain highly available, scalable cloud infrastructure on AWS/GCP.
- Drive incident response and post-mortem processes to ensure continuous service improvement.
- Automate manual operational tasks through scripting (Python, Go) and CI/CD pipeline integration.
- Manage production Kubernetes clusters with a focus on optimization and resource efficiency.
- Implement rigorous observability and monitoring strategies using Prometheus, Grafana, and ELK.
- Collaborate with engineering squads to ensure services are designed for reliability from day one.
- Proactively identify and mitigate system bottlenecks through performance tuning and capacity planning.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Deep proficiency in cloud platforms (AWS preferred, GCP acceptable).
- Expert-level experience with Kubernetes, Docker, and container orchestration.
- Strong programming skills in Go, Python, or Ruby.
- Hands-on experience with Infrastructure as Code (Terraform, Pulumi, or Ansible).
- Solid understanding of distributed systems, networking, and security best practices.
- Proven track record of managing large-scale production environments with 99.99% uptime requirements.