Deskripsi Pekerjaan
Are you obsessed with system reliability, performance, and automation? NexusCloud Systems is looking for a Senior Site Reliability Engineer to help us build and maintain our next-generation cloud infrastructure. You will be the bridge between development and operations, ensuring our high-scale services remain resilient, secure, and performant.
We operate at a massive scale and believe in 'everything as code.' If you are passionate about reducing toil and optimizing system performance, we want to talk to you.
Tanggung Jawab
- Design, build, and maintain scalable infrastructure on AWS and Kubernetes.
- Automate manual operational processes to eliminate toil and improve system efficiency.
- Lead incident response, root cause analysis, and post-mortem investigations for production systems.
- Define and implement Service Level Objectives (SLOs) and Error Budgets.
- Collaborate with software engineering teams to improve application performance and reliability.
- Manage CI/CD pipelines to ensure seamless, secure, and rapid deployment of services.
- Mentor junior engineers and promote a culture of operational excellence.
Kualifikasi
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Proficiency in programming languages such as Go, Python, or Java.
- Expertise in container orchestration tools, specifically Kubernetes and Helm.
- Deep understanding of cloud infrastructure (AWS/GCP) and Infrastructure as Code (Terraform).
- Solid experience with monitoring and logging stacks like Prometheus, Grafana, and ELK.
- Strong problem-solving skills and the ability to work under pressure during high-impact outages.