Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and automating the mundane? NexusCloud Systems is seeking a visionary Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will play a pivotal role in designing, building, and maintaining the highly available cloud-native environments that power our global SaaS platform.
We don't just 'keep the lights on'; we engineer solutions that prevent outages before they happen. Join a culture of SRE excellence where innovation is encouraged and impact is visible.
Tanggung Jawab
- Architect, implement, and optimize scalable cloud infrastructure on AWS and Kubernetes.
- Automate operational tasks using Infrastructure as Code (Terraform) and CI/CD pipelines.
- Drive incident management and post-mortem analysis to enhance system resilience.
- Implement proactive monitoring, logging, and alerting strategies to ensure 99.99% uptime.
- Lead capacity planning and performance tuning to support rapid traffic growth.
- Mentor junior engineers and promote a culture of operational excellence across teams.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Expert-level proficiency with AWS, Kubernetes (EKS), and container orchestration.
- Strong development skills in Go, Python, or Ruby for automation.
- Deep understanding of observability tools like Prometheus, Grafana, and Datadog.
- Proven ability to troubleshoot complex, distributed systems in a production environment.
- Experience managing large-scale PostgreSQL or NoSQL database clusters.