JOB SUMMARY: The responsibilities include overseeing system performance and uptime, managing IT digital operations, maintaining and improving systems' operational efficiency with an emphasis on deployment automation and system optimization, and ensuring consistent performance and reliability. The candidate should possess strong technical problem-solving skills and a desire to implement scalable and sustainable technological solutions. Key Responsibilities, Relationships & Measures of Success: Provide strategic direction for technologies and solutions in digital operations. Lead infrastructure and application builds, and technical maintenance alongside core engineering and delivery teams. Identify areas for service improvement by analyzing and diagnosing reoccurring platform and service incidents, as well as customer and stakeholder feedback. Serve as the custodian of SRE SLO, SLI, and Error Budgets. Assist in designing and implementing scalable, highly available system architectures to handle increasing loads and user demands without compromising performance. Create and optimize CI/CD pipelines to automate testing and deployment processes, reducing development-to-production time and ensuring consistent quality control. Design, monitor, and respond to system alerts; monitor system performance, identify bottlenecks, and execute optimization and permanent fixes. Manage incident response protocols, including on-call rotations. Conduct post-incident reviews to prevent recurrence and refine the system reliability framework. Provide primary operational support and engineering for multiple large-scale distributed software applications. Collaborate with development operations staff to create, monitor, and troubleshoot system infrastructure. Enhance system resilience and serve larger customer volumes through expert-level coding, robust release, and change management skills. Improve automation and increase the system’s self-healing capability. Collect operating system data and report performance metrics to stakeholders. Manage cloud and database system maintenance, debugging production issues as they arise. Ensure the effective and seamless integration of security policies and practices into DevOps workflows to reduce overall risk and deliver products and services on time. Collaborate with product development teams to understand upcoming products, enabling continuous integration and continuous deployment. Implement end-to-end automated VAPT for any new or existing applications. Reduce planned deployment downtime by 50% through a robust CI/CD setup. Achieve a Mean Time to Recovery (MTTR) of less than one hour for any SEV1 issue. Achieve a Mean Time to Detect (MTTD) of less than five minutes with the help of automated tools and methods. Key Competencies/Skills Required and Experience: This is a technical manager position that requires the ability to perform independent proof of concepts (POCs) and collaborate with cross-functional departments. The candidate should possess the following technical skills and experience: Bachelor's degree (B.E. / BTech. preferable). At least 14 years of extensive experience in DevSecOps and Site Reliability Engineering (SRE), including leading teams. Demonstrated experience in managing large-scale distributed systems with an understanding of scalability and reliability principles. Strong hands-on experience in Service Management and Change Management using tools such as ServiceNow. Experience in managing enterprise activities like disaster recovery (DR) tests and other infrastructure activities. Responsible for DevOps DORA metrics and SRE toil reduction. Proficiency in security tools such as Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), and container security. Expertise in Infrastructure as Code (IaC) tools like Terraform and CloudFormation. Experience with container technologies such as Docker, Kubernetes, and OpenShift. Knowledge of DevSecOps tools including Git, Maven, Selenium, Jenkins, Ansible, and various security tools. Understanding monitoring tools like Nagios, Dynatrace, and SolarWinds. Scripting knowledge in Shell, Python (preferred), Groovy, and YAML. Experience and understanding of at least one cloud provider, such as AWS, Azure, or Oracle Cloud Infrastructure (OCI). Ability to provision on-demand infrastructure, including environment spin-offs and cloning, using EKS and IaC. Hands-on experience configuring SLA, SLO, SLIs, and infrastructure plus business rules/logic in application performance management (APM) tools such as Dynatrace, AWS CloudWatch, and DataDog. Understanding of network protocols, load balancing, and firewall management for secure and efficient network operations.
Job Title
Vice President - Site Reliability Engineering