Deskripsi Pekerjaan
Are you obsessed with system performance and uptime? NexusCloud Systems is seeking a highly skilled Senior Site Reliability Engineer to join our infrastructure team in San Francisco. You will play a pivotal role in designing, building, and maintaining our global cloud architecture, ensuring high availability and scalability for millions of concurrent users.
We operate at the intersection of software engineering and systems operations, using code to solve complex infrastructure problems. If you are a proponent of 'Infrastructure as Code' and thrive in a fast-paced environment, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain highly scalable, distributed production systems.
- Automate manual operational tasks using Python, Go, or Bash.
- Drive capacity planning, performance tuning, and system optimization.
- Lead incident response, root cause analysis, and post-mortem investigations.
- Implement proactive monitoring, alerting, and observability solutions.
- Collaborate with development teams to integrate CI/CD best practices.
- Manage cloud infrastructure (AWS/GCP) using Terraform or similar IaC tools.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Proficiency in cloud infrastructure management (AWS preferred).
- Strong expertise in Linux systems administration and network protocols.
- Hands-on experience with containerization technologies like Kubernetes and Docker.
- Deep understanding of Infrastructure as Code (Terraform, CloudFormation).
- Experience with observability stacks (Prometheus, Grafana, Datadog, ELK).
- Strong proficiency in at least one scripting or programming language (Go, Python, or Ruby).
- Bachelor’s degree in Computer Science or equivalent practical experience.