Deskripsi Pekerjaan
Are you obsessed with system stability, scalability, and performance? NexusCloud Systems is looking for a Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our high-traffic global platform remains resilient and performant.
We are building the next generation of cloud-native infrastructure, and we need an expert who thrives on automation, chaos engineering, and proactive incident management to join our team in the heart of San Francisco.
Tanggung Jawab
- Design, build, and maintain highly available, distributed systems in a multi-region cloud environment.
- Lead incident response efforts and conduct blameless post-mortems to improve system reliability.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Optimize cloud costs and performance through capacity planning and resource tuning.
- Develop and manage SLOs, SLIs, and comprehensive monitoring/alerting dashboards.
- Mentor junior engineers and promote a culture of operational excellence and best practices.
- Implement CI/CD pipelines to streamline deployment velocity and reduce release risks.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Expertise in AWS or GCP cloud architecture and managed services.
- Strong proficiency in Infrastructure as Code (Terraform) and configuration tools (Ansible/Chef).
- Hands-on experience with Kubernetes, containerization, and service meshes (Istio/Linkerd).
- Proficiency in at least one programming language such as Go, Python, or Ruby for automation.
- Deep understanding of observability tools like Prometheus, Grafana, Datadog, or New Relic.
- Experience managing high-volume SQL and NoSQL databases (e.g., PostgreSQL, MongoDB, Cassandra).