Deskripsi Pekerjaan
Are you obsessed with uptime, scalability, and performance? NexusScale is looking for a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will be the bridge between development and operations, ensuring our cloud-native platforms operate with elite-level precision and resilience.
We don't just 'keep the lights on'; we engineer high-availability systems that handle millions of requests daily. If you thrive on solving complex distributed systems puzzles, this is your next career destination.
Tanggung Jawab
- Design, implement, and maintain highly available and scalable cloud infrastructure using Terraform and Kubernetes.
- Automate manual operational tasks to minimize 'toil' and improve system reliability.
- Lead incident response efforts, conduct blameless post-mortems, and identify root causes for system outages.
- Optimize system performance, latency, and resource utilization across our microservices architecture.
- Mentor junior engineers and champion SRE best practices across the engineering organization.
- Implement robust monitoring, logging, and alerting solutions using Prometheus, Grafana, and ELK.
- Collaborate with product teams to plan for capacity requirements and service growth.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering with a focus on infrastructure.
- Deep proficiency in cloud platforms (AWS or GCP) and container orchestration with Kubernetes.
- Strong coding skills in Go, Python, or Ruby for automation and tool development.
- Proven expertise in Infrastructure as Code (IaC) tools, specifically Terraform and Helm.
- Solid understanding of CI/CD pipelines (Jenkins, GitHub Actions, or GitLab CI).
- Experience managing high-traffic, distributed databases (PostgreSQL, Cassandra, or Redis).
- Excellent communication skills with the ability to articulate technical debt to non-technical stakeholders.