Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and automating the mundane? Nexus Cloud Infrastructure is looking for a world-class Senior Site Reliability Engineer to join our high-impact team. We are pushing the boundaries of distributed systems and cloud-native architecture, and we need an expert to help us build, secure, and maintain the platforms that power our global services.
You will play a critical role in bridging the gap between software development and IT operations, ensuring our services are resilient, observable, and lightning-fast.
Tanggung Jawab
- Architect and maintain highly scalable, distributed cloud infrastructure on AWS/GCP.
- Drive automated provisioning and configuration management using Terraform and Ansible.
- Champion SRE best practices, including error budgets, SLIs, SLOs, and incident response.
- Lead post-mortem analysis and implement long-term engineering solutions to prevent recurring issues.
- Develop internal tooling to improve developer productivity and deployment velocity.
- Participate in an on-call rotation to ensure 99.99% system availability.
- Collaborate with cross-functional engineering teams to optimize system performance and latency.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Software Engineering roles.
- Expert-level proficiency in cloud infrastructure (AWS or GCP) and Kubernetes orchestration.
- Strong coding skills in Go, Python, or Ruby for automation and tool development.
- Deep understanding of Linux internals, networking, and distributed system architectures.
- Experience with monitoring and observability stacks like Prometheus, Grafana, or ELK.
- Proven ability to troubleshoot complex production issues in high-traffic environments.
- Excellent communication skills and a passion for mentoring junior team members.