So, what’s the role all about? The Site Reliability Engineer works as an software developer in reliability for a specific software application or suite of applications and accompanying infrastructure. This includes implementation of new systems as well as providing mid-level and escalation support for other groups and working to resolve production issues in conjunction with development, operational, and architectural resources.How will you make an impact? Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault findingPartner with development teams to improve services through rigorous testing and release proceduresParticipate in system design consulting, platform management, and capacity planningCreate sustainable systems and services through automation and upliftsBalance feature development speed and reliability with well-defined service level objectives Have you got what it takes?Bachelor’s degree in computer science, Engineering, or related field (or equivalent experience).8-10 years of working experience in a similar role, with a focus on systems engineering, automation, and reliability.Proficiency in at least one programming language (e.g., Python, Go, Java, C#) and experience with scripting languages (e.g., Bash, PowerShell).Deep understanding of cloud computing platforms (e.g., AWS), the working and reliability constraints of some of the prominent services (e.g., EC2, ECS, Lambda, DynamoDB etc)Experience with infrastructure as code tools such as CloudFormation, Terraform.Deep understanding of CI/CD concepts and experience with CI/CD tools such as Jenkins, GitLab CI/CD, or CircleCI.Strong knowledge of containerization technologies (e.g., Docker, Kubernetes) and microservices architecture.Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack, Cloudwatch).Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.Experience of Incident management and blameless postmortems that includes driving the incident response efforts during outages and other critical incidents, resolution, and communication in a cross-functional team setup.Strong communication skills and the ability to collaborate effectively with cross-functional teams.Team player - ability to work well in a close team environment.Fast learner with ability to educate her/himself on relevant technologiesAbility to multitask and prioritize workAbility to remain focused and calm under pressureGood to have skills:Handson experience of working with large Kubernetes Cluster. Certification will be an added plus.Working experience of Grafana Observability Suite (Loki, Mimir, Tempo).Administration and/or development experience of standard monitoring and automation tools such as Splunk, Datadog, Pagerduty Rundeck.Familiarity with configuration management tools like Ansible, Puppet, or Chef.Certifications such as AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer, or equivalent.
Job Title
Site Reliability Engineer