Deskripsi Pekerjaan
Elevate the Future of Cloud Infrastructure
NexusCloud Systems is seeking a visionary Senior Site Reliability Engineer to join our high-impact SRE team in San Francisco. You will be at the forefront of designing, building, and scaling our global cloud infrastructure, ensuring 99.999% availability for our mission-critical enterprise platforms. We are looking for an expert in distributed systems who thrives on automating away manual toil and driving resilience into complex architecture.
Tanggung Jawab
- Design and maintain highly available, scalable, and resilient distributed systems on AWS and GCP.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Champion observability practices by implementing advanced monitoring, logging, and tracing solutions (Datadog, Prometheus).
- Lead incident response, perform blameless post-mortems, and drive architectural improvements to prevent recurrence.
- Collaborate with development teams to integrate CI/CD pipelines and ensure seamless deployment cycles.
- Develop and maintain Kubernetes clusters at scale to support microservices architecture.
- Define and track Service Level Objectives (SLOs) and Error Budgets to balance feature velocity with system stability.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in Kubernetes, Docker, and container orchestration platforms.
- Proficiency in programming with Go, Python, or Ruby for infrastructure automation.
- Hands-on experience with IaC tools such as Terraform or CloudFormation.
- Strong background in Linux internals, networking (TCP/IP, DNS, Load Balancing), and security best practices.
- Proven ability to troubleshoot complex issues across the entire stack in high-pressure environments.