Deskripsi Pekerjaan
Are you obsessed with system reliability, performance, and scalability? NexusCloud Systems is seeking a high-impact Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our high-traffic global platforms remain resilient, performant, and secure.
You will have the autonomy to architect complex solutions, lead incident response, and drive a culture of 'automation-first' engineering.
Tanggung Jawab
- Design, build, and maintain highly scalable, distributed systems on AWS/GCP.
- Drive capacity planning and performance tuning to optimize infrastructure costs.
- Implement and manage CI/CD pipelines to streamline deployment velocity.
- Lead post-mortem analysis and incident response for critical service outages.
- Develop automated tooling and monitoring solutions to improve system observability.
- Collaborate with cross-functional teams to define and enforce SLOs/SLIs.
- Mentor junior engineers and advocate for SRE best practices throughout the SDLC.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering.
- Expert-level proficiency in Go, Python, or Java.
- Deep expertise with Kubernetes orchestration and containerization (Docker).
- Proven experience managing Infrastructure as Code (Terraform, Pulumi, or CloudFormation).
- Deep understanding of Linux internals, networking, and distributed system design.
- Experience with monitoring tools such as Prometheus, Grafana, or Datadog.
- Strong problem-solving skills and the ability to thrive in a fast-paced environment.