Deskripsi Pekerjaan
Build the future of cloud resilience.
At NexusCloud, we are obsessed with uptime, performance, and scalability. We are seeking a highly skilled Senior Site Reliability Engineer to join our mission-critical infrastructure team. In this role, you will bridge the gap between software development and IT operations, ensuring our global services remain resilient under extreme load. You will design automated solutions, optimize cloud infrastructure, and act as a primary responder for critical system incidents.
Tanggung Jawab
- Design, build, and maintain scalable, reliable, and secure cloud infrastructure on AWS/GCP.
- Implement Infrastructure as Code (IaC) using Terraform or Pulumi to ensure environment consistency.
- Develop and automate CI/CD pipelines to streamline deployment cycles and reduce manual toil.
- Lead post-mortem analysis and implement long-term fixes to prevent recurrent system failures.
- Collaborate with engineering teams to optimize application performance and latency.
- Manage capacity planning and resource allocation to ensure cost-efficiency and performance.
- Participate in an on-call rotation to support critical service availability.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering roles.
- Expertise in AWS or GCP cloud environments and managed services.
- Strong proficiency in programming or scripting (Go, Python, or Ruby).
- Hands-on experience with Kubernetes, Docker, and container orchestration at scale.
- Advanced knowledge of monitoring and observability tools (Prometheus, Grafana, Datadog).
- Proven track record of managing large-scale distributed systems.
- Deep understanding of Linux internals, networking, and security best practices.