Deskripsi Pekerjaan
At NexusScale, we are architecting the next generation of resilient cloud infrastructure. We are looking for a Senior Site Reliability Engineer to join our mission-critical platform team. You will be instrumental in bridging the gap between software development and IT operations, ensuring our global services remain performant, scalable, and secure.
This role offers the opportunity to tackle complex distributed systems challenges in a fast-paced, innovation-first environment. If you are passionate about automation, observability, and infrastructure-as-code, we want to hear from you.
Tanggung Jawab
- Design, build, and maintain scalable infrastructure using Terraform and Kubernetes.
- Automate operational tasks to minimize manual intervention and reduce toil.
- Proactively monitor system performance and troubleshoot complex production incidents.
- Define and implement Service Level Objectives (SLOs) and error budgets.
- Participate in a collaborative on-call rotation to ensure 99.99% system availability.
- Lead post-mortem analysis sessions to drive continuous improvement in system architecture.
- Collaborate with cross-functional engineering teams to integrate security best practices into the CI/CD pipeline.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Deep expertise in public cloud platforms (AWS, GCP, or Azure).
- Proven proficiency in container orchestration (Kubernetes) and service meshes.
- Strong programming skills in Python, Go, or Ruby for automation scripting.
- In-depth understanding of Linux internals, networking, and security protocols.
- Experience managing large-scale distributed databases and caching layers.