Deskripsi Pekerjaan
Are you obsessed with system uptime, latency, and building highly scalable distributed systems? Nexus Cloud Infrastructure is seeking a Senior Site Reliability Engineer to join our core platform team in San Francisco. You will be the bridge between development and operations, ensuring our global cloud infrastructure remains resilient, performant, and secure.
We operate at a massive scale, and we value engineers who think in terms of automation, observability, and systematic problem solving. If you are passionate about pushing the boundaries of what is possible in cloud-native environments, we want to meet you.
Tanggung Jawab
- Design and maintain highly available, fault-tolerant infrastructure on AWS and Kubernetes.
- Automate operational tasks using Go, Python, or Terraform to reduce manual toil.
- Lead incident response, root cause analysis, and post-mortem discussions for production outages.
- Improve system observability through advanced logging, distributed tracing, and real-time monitoring.
- Develop and manage CI/CD pipelines to streamline code deployment and infrastructure provisioning.
- Collaborate with software engineering teams to optimize application performance and architecture.
- Define and implement Service Level Objectives (SLOs) and Error Budgets to balance reliability with velocity.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expert-level proficiency with AWS, Kubernetes, and Docker orchestration.
- Deep understanding of IaC tools like Terraform or CloudFormation.
- Strong coding skills in Python, Go, or Ruby for automation and tool development.
- Experience with monitoring stacks such as Prometheus, Grafana, Datadog, or ELK.
- Deep knowledge of networking protocols (TCP/IP, HTTP/S, DNS) and load balancing strategies.
- Excellent communication skills with the ability to lead cross-functional technical initiatives.