Deskripsi Pekerjaan
At NexusCloud, we are building the backbone of next-generation distributed systems. We are looking for a visionary Senior Site Reliability Engineer to join our high-impact SRE team in San Francisco. You will be responsible for the scalability, reliability, and automation of our global cloud infrastructure, ensuring our platform maintains 99.99% availability for millions of users worldwide.
You thrive on solving complex distributed systems problems and are passionate about 'infrastructure as code'. If you want to work at the intersection of software engineering and systems operations, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain highly scalable and fault-tolerant distributed systems.
- Automate infrastructure provisioning and configuration management using Terraform and Ansible.
- Drive capacity planning, performance tuning, and operational efficiency across the production environment.
- Lead incident response efforts and conduct blameless post-mortems to improve system resilience.
- Implement advanced monitoring, logging, and alerting solutions to ensure deep system observability.
- Collaborate with cross-functional software engineering teams to advocate for SRE best practices.
- Mentor junior engineers and contribute to an inclusive, high-growth technical culture.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or large-scale systems engineering.
- Deep expertise in public cloud platforms (AWS, GCP, or Azure).
- Strong programming proficiency in at least one language: Go, Python, or Java.
- Advanced knowledge of Kubernetes, container orchestration, and microservices architecture.
- Experienced with CI/CD pipelines and modern GitOps workflows.
- Excellent analytical, problem-solving, and communication skills in a fast-paced environment.