Deskripsi Pekerjaan
Are you obsessed with system stability, scalability, and performance? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. We operate at massive scale, and we need your expertise to optimize our cloud architecture, automate incident response, and ensure 99.999% uptime for our global user base.
You will work at the intersection of software engineering and systems operations, building the tools that empower our developers to ship code faster and safer.
Tanggung Jawab
- Architect and maintain highly available, scalable, and secure cloud infrastructure.
- Automate manual operational tasks using Python, Go, or Terraform.
- Lead incident response and perform deep-dive post-mortems to prevent recurrence.
- Design and implement observability stacks (Prometheus, Grafana, ELK) to monitor system health.
- Collaborate with engineering teams to optimize application performance and resource utilization.
- Establish CI/CD pipelines to streamline deployment velocity.
- Mentor junior engineers and foster a culture of reliability throughout the engineering organization.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expert-level proficiency with AWS, GCP, or Azure.
- Deep understanding of container orchestration using Kubernetes.
- Strong programming skills in Python, Go, or Bash.
- Experience with Infrastructure as Code (Terraform, CloudFormation, Ansible).
- Solid understanding of networking protocols (TCP/IP, DNS, Load Balancing, SSL/TLS).
- Proven ability to troubleshoot complex, distributed systems in a high-pressure environment.