Site Reliability Engineer (SRE) Position Overview:We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) with strong expertise in Python, advanced proficiency in Azure-based infrastructure, and significant experience in Customer Reliability Engineering (CRE) and Automation. The ideal candidate will have 3 to 5 years of experience in SRE or related fields and a proven ability to design, deploy, and maintain scalable, reliable, and high-performing cloud solutions. This role focuses on driving system reliability, leveraging automation to optimize operations, and delivering robust solutions for complex infrastructure challenges. Key Responsibilities: Design & Plan-Design and implement comprehensive Elastic (ELK stack) solutions, including Elasticsearch, Logstash, and Kibana.Analyze and document requirements to improve existing infrastructure through automation ("Infrastructure as Code") and seamless Azure cloud integration.Develop and document architectural designs for scalable Azure solutions, tailored to customer requirements. Build & Deploy-Build robust CI/CD pipelines (Azure DevOps, Jenkins, ArgoCD) to support efficient code deployment and reusable automation workflows.Advance scripting and automation frameworks using Python, Bash, and Painless scripting languages.Manage, troubleshoot, and enhance Kubernetes clusters, including Azure Kubernetes Service (AKS)environments.Deploy production-ready Elasticsearch clusters on-premises and in Kubernetes clusters. Operate & Support-Proactively monitor systems using tools like Azure Monitor, Elastic Observability, and Application Insights, ensuring high availability and performance.Develop self-healing mechanisms and automated scaling for distributed systems to reduce downtime and improve reliability.Lead incident response processes, conduct root cause analysis, and drive post-mortem discussions to prevent recurring issues.Collaborate with security teams to implement and maintain best practices for system security and compliance. Automation-Develop robust automation scripts for repetitive operational workflows, configuration management, and deployment pipelines using tools such as Ansible, Terraform, and Helm.Drive enhancements in infrastructure automation to enable seamless deployments and self-service capabilities for engineering teams. Collaboration & Customer Engagement-Partner with cross-functional teams (engineering, operations, and product) to design systems with reliability and performance in mind.Work closely with customers to address specific reliability challenges and ensure tailored Azure-based solutions meet their operational needs.Foster a DevOps culture and champion best practices across teams. Qualifications: Experience-5+ years of hands-on experience as SRE / SRE Automation Engineer.Proven expertise in designing, deploying, and managing Azure cloud infrastructure and services.Significant experience in Elastic stack (ELK), including managing Elasticsearch clusters, Logstash pipelines, and Kibana visualizations.Advanced proficiency in Python scripting and automation for large-scale systems.Strong knowledge of Kubernetes cluster management, including AKS.Demonstrated experience building CI/CD pipelines and deploying applications in distributed environments.Working knowledge of containerization tools like Docker and orchestration technologies. Technical Skills-Azure Expertise: Azure Kubernetes Service (AKS), Azure DevOps, Application Insights, Log Analytics, and Azure security best practices.Automation Tools: Proficiency with Ansible, Terraform, Helm, and ArgoCD.Scripting: Python (advanced), Bash, Painless scripting for Elasticsearch pipelines.Monitoring: Elastic Observability, Grafana, and Azure-native tools.Networking: Understanding of virtual networks, firewalls, and RBAC in cloud environments.Security: Familiarity with OAuth, SAML, and secure deployment methodologies.Knowledge of highly scalable systems, RESTful APIs, and caching mechanisms. Soft Skills-Strong problem-solving and troubleshooting skills for complex distributed systems.Excellent communication and collaboration skills, including the ability to liaise between technical teams and non-technical stakeholders.Customer-focused approach, with a track record of designing solutions that meet client-specific reliability requirements.Proactive, self-motivated, and committed to continuous learning and improvement. Education-Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent practical experience). Preferred Qualifications-Working knowledge of Elastic Cloud for Kubernetes (ECK).Certification in Microsoft Azure or Kubernetes.Experience implementing GitOps methodologies for deployment automation.
Job Title
Site Reliability Engineer