Deskripsi Pekerjaan
Are you obsessed with system uptime, performance at scale, and the relentless pursuit of automation? CloudScale Dynamics is looking for a seasoned Senior Site Reliability Engineer to join our core infrastructure team in San Francisco. You will be instrumental in building the next generation of our distributed systems, ensuring our global platform remains resilient, performant, and secure under extreme load.
We value engineers who treat operations as a software engineering problem. If you enjoy building tools that empower developers to ship faster while maintaining ironclad reliability, we want to hear from you.
Tanggung Jawab
- Design and manage highly available, fault-tolerant infrastructure on AWS/GCP.
- Develop and maintain CI/CD pipelines to streamline deployment velocity.
- Implement observability solutions using Prometheus, Grafana, and ELK stack.
- Lead incident response efforts and conduct blameless post-mortems.
- Automate manual operational tasks using Python, Go, or Terraform.
- Collaborate with cross-functional engineering teams to optimize cloud spend and resource utilization.
- Maintain system security through automated patching and compliance monitoring.
Kualifikasi
- 5+ years of experience in SRE, DevOps, or Systems Engineering.
- Expertise in cloud infrastructure (AWS/GCP) and container orchestration (Kubernetes).
- Strong proficiency in Infrastructure as Code (Terraform, Pulumi, or CloudFormation).
- Advanced programming skills in Python, Go, or Bash.
- Deep understanding of Linux internals, networking protocols, and distributed systems.
- Experience with monitoring/alerting ecosystems and incident management tools (PagerDuty).
- Excellent problem-solving skills and a proactive mindset toward system health.