Who are Heidi?Heidi is on a mission to half the time it takes to deliver world class care.We believe that in 2050 every clinician will practice with AI systems that free them from administrative burden and increase the quality and accessibility of care to patients across the world.Today, we have a suite of tools that modernise documentation. Tomorrow, well equip every healthcare org with AI assistants that undo the tediums of clinical & non clinical workOur team is a potent mosaic of sage, accomplished leaders & brilliant polymaths hungry to prove it. We achieve in 6 months what it takes our competitors 4 years to do.Weve raised our $10M Series A led by Australia''s largest VC firm, Blackbird Ventures and in the midst of our next raise, with an ambitious global go-to-market strategy starting with the US & UK.The RoleAs a Senior Site Reliability Engineer at Heidi, you''ll be instrumental in establishing and scaling our reliability practices while ensuring robust, secure, and observable systems.You''ll work closely with our engineering team to implement comprehensive monitoring, incident management, and reliability processes for our AI-powered healthcare solutions.Primary Responsibilities:Observability & MonitoringDesign and implement comprehensive observability strategies using Datadog, or other tooling that you are able to convince us with!Implement OpenTelemetry instrumentation across our backend and frontend servicesSet up real user monitoring (RUM) and application performance monitoring (APM) to ensure end-to-end visibilityCreate and maintain dashboards that provide meaningful insights for different stakeholders (technical teams, support, management)Monitor and optimise third-party service integrations, particularly for critical servicesIncident Management & ResponseEstablish and implement incident management processes from the ground upEvaluate and implement appropriate incident management tools that integrate with our observability stackCreate and maintain incident response playbooks and automated runbooksLead post-incident reviews and foster a blameless cultureImplement and maintain on-call rotations and escalation policiesSLA & SLO ManagementDefine and implement SLOs that align with business requirements and customer expectationsSet up error budgets and tracking mechanismsCreate comprehensive SLA reporting for enterprise customersDesign and implement SLI metrics that provide meaningful insights into service healthCost Optimisation & EfficiencyOptimise observability costs through efficient logging and metrics collectionImplement log management and retention strategiesFine-tune alerting to minimise alert fatigue while maintaining service reliabilityEvaluate and recommend cost-effective tooling solutionsKey Requirements:Extensive experience with observability platforms (Datadog preferred) and understanding of observability architectureStrong knowledge of OpenTelemetry and modern instrumentation practicesExperience implementing APM and RUM in Python and React/React Native environmentsTrack record of establishing incident management processes and fostering a blameless cultureExperience defining and implementing SLAs/SLOs for enterprise customersStrong background in monitoring distributed systems and third-party service integrationsExperience with cloud infrastructure (AWS required, Azure and GCP beneficial)Proven track record in implementing SRE practices and reliability improvementsPreferred Qualifications:Experience with chaos engineering practicesKnowledge of automated runbook implementationHealthcare industry experienceUnderstanding of HIPAA or similar healthcare compliance frameworksWhat we will look for:Problem-solving mindset with a focus on reliability and scalabilityStrong communication skills to work with cross-functional teamsAbility to balance technical requirements with business needsExperience in fast-paced startup environmentsDedication to maintaining high standards in a regulated environmentWhat do we believe in?We create unconventional solutions to difficult problems and we build them fast. We want you to set impossible goals and make them happen, think landing a rocket but the medical version.You''ll be surrounded by a world-class team of engineers, medicos and designers to do your best work, inspired by our shared beliefs:We will stop at nothing to improve patient care across the world.We design user experiences for joy and ship them fast.We make decisions in a flat hierarchy that prioritises the truth over rank.We provide the resources for people to succeed and give them the freedom to do it.Why you will flourish with us ?Flexible hybrid working environment, with 3 days in the office.Additional paid day off for your birthday and wellness daysSpecial corporate rates at Anytime Fitness in Melbourne, Sydney tbc.A generous personal development budget of $500 per annumLearn from some of the best engineers and creatives, joining a diverse teamBecome an owner, with shares (equity) in the company, if Heidi wins, we all winThe rare chance to create a global impact as you immerse yourself in one of Australias leading healthtech startupsIf you have an impact quickly, the opportunity to fast track your startup career! #J-18808-Ljbffr
Job Title
Senior SRE