MLOps Support Team Lead job at CloudFactory
New
Website :
Today
Linkedid Twitter Share on facebook
MLOps Support Team Lead
2026-05-18T14:41:06+00:00
CloudFactory
https://cdn.greatkenyanjobs.com/jsjobsdata/data/default_logo_company/defaultlogo.png
FULL_TIME
Nairobi
Nairobi
00100
Kenya
Manufacturing
Management, Computer & IT, Science & Engineering, Business Operations, Team leader
KES
MONTH
2026-05-25T17:00:00+00:00
8

Role Summary

As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory's MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.

You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.

Responsibilities: Service Ownership & Reliability

  • Own the operational performance of all production ML systems and pipelines
  • Ensure reliability, availability, and supportability across client and internal MLOps workloads
  • Establish and enforce SLAs, SLOs, and operational standards
  • Act as the escalation point for major incidents and service degradation

Team Leadership & Delivery

  • Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
  • Define shift patterns, on-call rotations, and coverage models
  • Set clear expectations, performance metrics, and development plans
  • Foster a strong operational culture focused on accountability and continuous improvement

Incident Management & RCA

  • Own incident response processes, including triage, communication, and resolution
  • Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
  • Drive reduction in repeat incidents through structured problem management
  • Improve time to detect (TTD) and time to resolve (TTR) metrics

Monitoring, Observability & MLOps Maturity

  • Drive implementation and evolution of monitoring across:
    • pipelines and data flows
    • infrastructure and compute
    • model performance and drift
  • Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
  • Partner with Engineering to improve instrumentation, logging, and alerting

Support Model & Process Design

  • Define and evolve the MLOps support operating model
  • Clearly establish boundaries between Support, Engineering, and external partners
  • Build and maintain runbooks, playbooks, and escalation paths
  • Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)

Stakeholder & Partner Management

  • Act as the primary operational interface for:
    • Engineering teams
    • Platform Operations
    • External partners
  • Reduce reliance on individuals by formalizing ownership and knowledge sharing
  • Provide clear communication during incidents and service updates

Continuous Improvement & Scaling

  • Identify trends in incidents and operational inefficiencies
  • Drive improvements in:
    • automation
    • alert quality
    • self-healing capabilities
  • Support onboarding of new MLOps projects into a standardized support model
  • Contribute to building MLOps as a scalable, repeatable service offering

Reporting & Service Health

  • Define and track key operational metrics:
    • incident volume and severity
    • SLA adherence
    • system uptime and reliability
  • Support regular service reviews and model health reporting
  • Provide leadership visibility into risks, trends, and improvement areas

Requirements Must Have skills (required)

  • Proven experience in operations leadership, SRE, DevOps, or platform support environments
  • Strong understanding of production support models, incident management, and escalation frameworks
  • Experience leading or mentoring technical support or operations teams
  • Working knowledge of ML systems in production, including:
    • pipelines and batch processing
    • model lifecycle and deployment
    • common failure modes
  • Strong analytical and troubleshooting skills in complex environments
  • Experience with monitoring and observability tools
  • Proficiency in:
    • SQL
    • Python or scripting (Bash)
  • Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
  • Strong stakeholder management and communication skills

Nice To Have Skills (Preferred)

  • Experience supporting AI/ML platforms at scale
  • Familiarity with tools such as:
    • Databricks
    • MLflow
    • Grafana
    • Power BI
    • New Relic
  • Exposure to model monitoring (drift, bias, performance validation)
  • Experience working with external partners or vendors in delivery models
  • Understanding of cloud platforms (AWS, GCP, Azure)
  • Experience with containerized environments (Docker / Kubernetes)
  • Background in building or scaling support functions from early-stage to maturity

General Requirements

  • Strong service ownership mindset — takes accountability for outcomes, not just activity
  • Calm, structured, and decisive during incidents
  • Ability to balance operational delivery with strategic improvement
  • Passion for building reliable, trustworthy AI/ML systems
  • Highly collaborative across Engineering, Platform, and Delivery teams
  • Focus on reducing risk related to:
    • modeil performance
    • bias
    • data integrity
  • Commitment to documentation, knowledge sharing, and eliminating single points of failure
  • Own the operational performance of all production ML systems and pipelines
  • Ensure reliability, availability, and supportability across client and internal MLOps workloads
  • Establish and enforce SLAs, SLOs, and operational standards
  • Act as the escalation point for major incidents and service degradation
  • Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
  • Define shift patterns, on-call rotations, and coverage models
  • Set clear expectations, performance metrics, and development plans
  • Foster a strong operational culture focused on accountability and continuous improvement
  • Own incident response processes, including triage, communication, and resolution
  • Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
  • Drive reduction in repeat incidents through structured problem management
  • Improve time to detect (TTD) and time to resolve (TTR) metrics
  • Drive implementation and evolution of monitoring across: pipelines and data flows, infrastructure and compute, model performance and drift
  • Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
  • Partner with Engineering to improve instrumentation, logging, and alerting
  • Define and evolve the MLOps support operating model
  • Clearly establish boundaries between Support, Engineering, and external partners
  • Build and maintain runbooks, playbooks, and escalation paths
  • Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)
  • Act as the primary operational interface for: Engineering teams, Platform Operations, External partners
  • Reduce reliance on individuals by formalizing ownership and knowledge sharing
  • Provide clear communication during incidents and service updates
  • Identify trends in incidents and operational inefficiencies
  • Drive improvements in: automation, alert quality, self-healing capabilities
  • Support onboarding of new MLOps projects into a standardized support model
  • Contribute to building MLOps as a scalable, repeatable service offering
  • Define and track key operational metrics: incident volume and severity, SLA adherence, system uptime and reliability
  • Support regular service reviews and model health reporting
  • Provide leadership visibility into risks, trends, and improvement areas
  • SQL
  • Python
  • Bash
  • Monitoring and observability tools
  • Incident management
  • Troubleshooting
  • Stakeholder management
  • Communication
  • Proven experience in operations leadership, SRE, DevOps, or platform support environments
  • Strong understanding of production support models, incident management, and escalation frameworks
  • Experience leading or mentoring technical support or operations teams
  • Working knowledge of ML systems in production, including: pipelines and batch processing, model lifecycle and deployment, common failure modes
  • Strong analytical and troubleshooting skills in complex environments
  • Experience with monitoring and observability tools
  • Proficiency in SQL
  • Proficiency in Python or scripting (Bash)
  • Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
  • Strong stakeholder management and communication skills
  • Experience supporting AI/ML platforms at scale (Preferred)
  • Familiarity with tools such as: Databricks, MLflow, Grafana, Power BI, New Relic (Preferred)
  • Exposure to model monitoring (drift, bias, performance validation) (Preferred)
  • Experience working with external partners or vendors in delivery models (Preferred)
  • Understanding of cloud platforms (AWS, GCP, Azure) (Preferred)
  • Experience with containerized environments (Docker / Kubernetes) (Preferred)
  • Background in building or scaling support functions from early-stage to maturity (Preferred)
  • Strong service ownership mindset — takes accountability for outcomes, not just activity
  • Calm, structured, and decisive during incidents
  • Ability to balance operational delivery with strategic improvement
  • Passion for building reliable, trustworthy AI/ML systems
  • Highly collaborative across Engineering, Platform, and Delivery teams
  • Focus on reducing risk related to: model performance, bias, data integrity
  • Commitment to documentation, knowledge sharing, and eliminating single points of failure
bachelor degree
12
JOB-6a0b250221217

Vacancy title:
MLOps Support Team Lead

[Type: FULL_TIME, Industry: Manufacturing, Category: Management, Computer & IT, Science & Engineering, Business Operations, Team leader]

Jobs at:
CloudFactory

Deadline of this Job:
Monday, May 25 2026

Duty Station:
Nairobi | Nairobi

Summary
Date Posted: Monday, May 18 2026, Base Salary: Not Disclosed

Similar Jobs in Kenya
Learn more about CloudFactory
CloudFactory jobs in Kenya

JOB DETAILS:

Role Summary

As the MLOps Operations Lead, you will own the day-to-day reliability, supportability, and operational maturity of CloudFactory's MLOps service. You will lead a global support team responsible for monitoring, triaging, and resolving issues across production ML systems, while driving improvements in observability, incident management, and service delivery.

You will work closely with Engineering, Platform Ops, and external partners to ensure AI/ML solutions are not only functional, but stable, measurable, and trusted in production. This role is critical in transitioning MLOps from reactive support to a proactive, scalable service capability.

Responsibilities: Service Ownership & Reliability

  • Own the operational performance of all production ML systems and pipelines
  • Ensure reliability, availability, and supportability across client and internal MLOps workloads
  • Establish and enforce SLAs, SLOs, and operational standards
  • Act as the escalation point for major incidents and service degradation

Team Leadership & Delivery

  • Lead a global MLOps Support team (L1/L2) across regions (Colombia, Kenya, Nepal)
  • Define shift patterns, on-call rotations, and coverage models
  • Set clear expectations, performance metrics, and development plans
  • Foster a strong operational culture focused on accountability and continuous improvement

Incident Management & RCA

  • Own incident response processes, including triage, communication, and resolution
  • Ensure high-quality Root Cause Analysis (RCA) and follow-through on corrective actions
  • Drive reduction in repeat incidents through structured problem management
  • Improve time to detect (TTD) and time to resolve (TTR) metrics

Monitoring, Observability & MLOps Maturity

  • Drive implementation and evolution of monitoring across:
    • pipelines and data flows
    • infrastructure and compute
    • model performance and drift
  • Ensure visibility extends beyond system health to model accuracy, bias, and data integrity
  • Partner with Engineering to improve instrumentation, logging, and alerting

Support Model & Process Design

  • Define and evolve the MLOps support operating model
  • Clearly establish boundaries between Support, Engineering, and external partners
  • Build and maintain runbooks, playbooks, and escalation paths
  • Standardize intake, triage, and resolution workflows (e.g. Slack, ticketing systems)

Stakeholder & Partner Management

  • Act as the primary operational interface for:
    • Engineering teams
    • Platform Operations
    • External partners
  • Reduce reliance on individuals by formalizing ownership and knowledge sharing
  • Provide clear communication during incidents and service updates

Continuous Improvement & Scaling

  • Identify trends in incidents and operational inefficiencies
  • Drive improvements in:
    • automation
    • alert quality
    • self-healing capabilities
  • Support onboarding of new MLOps projects into a standardized support model
  • Contribute to building MLOps as a scalable, repeatable service offering

Reporting & Service Health

  • Define and track key operational metrics:
    • incident volume and severity
    • SLA adherence
    • system uptime and reliability
  • Support regular service reviews and model health reporting
  • Provide leadership visibility into risks, trends, and improvement areas

Requirements Must Have skills (required)

  • Proven experience in operations leadership, SRE, DevOps, or platform support environments
  • Strong understanding of production support models, incident management, and escalation frameworks
  • Experience leading or mentoring technical support or operations teams
  • Working knowledge of ML systems in production, including:
    • pipelines and batch processing
    • model lifecycle and deployment
    • common failure modes
  • Strong analytical and troubleshooting skills in complex environments
  • Experience with monitoring and observability tools
  • Proficiency in:
    • SQL
    • Python or scripting (Bash)
  • Ability to operate in a high-pressure, incident-driven environment while maintaining structure and clarity
  • Strong stakeholder management and communication skills

Nice To Have Skills (Preferred)

  • Experience supporting AI/ML platforms at scale
  • Familiarity with tools such as:
    • Databricks
    • MLflow
    • Grafana
    • Power BI
    • New Relic
  • Exposure to model monitoring (drift, bias, performance validation)
  • Experience working with external partners or vendors in delivery models
  • Understanding of cloud platforms (AWS, GCP, Azure)
  • Experience with containerized environments (Docker / Kubernetes)
  • Background in building or scaling support functions from early-stage to maturity

General Requirements

  • Strong service ownership mindset — takes accountability for outcomes, not just activity
  • Calm, structured, and decisive during incidents
  • Ability to balance operational delivery with strategic improvement
  • Passion for building reliable, trustworthy AI/ML systems
  • Highly collaborative across Engineering, Platform, and Delivery teams
  • Focus on reducing risk related to:
    • modeil performance
    • bias
    • data integrity
  • Commitment to documentation, knowledge sharing, and eliminating single points of failure

Work Hours: 8

Experience in Months: 12

Level of Education: bachelor degree

Job application procedure

Application Link:

Click Here to Apply Now

All Jobs | QUICK ALERT SUBSCRIPTION

Job Info
Job Category: Management jobs in Kenya
Job Type: Full-time
Deadline of this Job: Monday, May 25 2026
Duty Station: Nairobi | Nairobi
Posted: 18-05-2026
No of Jobs: 1
Start Publishing: 18-05-2026
Stop Publishing (Put date of 2030): 10-10-2076
Apply Now
Notification Board

Join a Focused Community on job search to uncover both advertised and non-advertised jobs that you may not be aware of. A jobs WhatsApp Group Community can ensure that you know the opportunities happening around you and a jobs Facebook Group Community provides an opportunity to discuss with employers who need to fill urgent position. Click the links to join. You can view previously sent Email Alerts here incase you missed them and Subscribe so that you never miss out.

Caution: Never Pay Money in a Recruitment Process.

Some smart scams can trick you into paying for Psychometric Tests.