Deskripsi Pekerjaan
Are you obsessed with system performance, scalability, and uptime? NexusCloud Systems is seeking a highly skilled Senior Site Reliability Engineer to join our core infrastructure team. In this role, you will be the bridge between development and operations, ensuring our global cloud infrastructure remains resilient, performant, and secure. You'll work in a fast-paced environment where your work directly impacts the experience of millions of users.
Tanggung Jawab
- Architect and maintain highly scalable, distributed cloud infrastructure on AWS/GCP.
- Develop and implement automation scripts using Python, Go, or Bash to reduce manual operational toil.
- Proactively monitor system performance and troubleshoot complex issues in production environments.
- Lead incident response efforts and conduct blameless post-mortems to improve future system reliability.
- Optimize CI/CD pipelines to ensure seamless and reliable software deployments.
- Collaborate with software engineering teams to design resilient application architectures.
- Define and maintain SLOs, SLIs, and error budgets for mission-critical services.
Kualifikasi
- 5+ years of experience in Site Reliability Engineering or DevOps roles.
- Deep expertise in Kubernetes, Docker, and container orchestration platforms.
- Proficiency in infrastructure-as-code tools such as Terraform or Pulumi.
- Strong experience with monitoring and observability stacks (e.g., Prometheus, Grafana, Datadog).
- Solid understanding of cloud networking, security protocols, and Linux system internals.
- Demonstrated ability to script and automate complex tasks in Python or Go.
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.