Senior Site Reliability Engineer job at Deimos
New
Website :
Today
Linkedid Twitter Share on facebook
Senior Site Reliability Engineer
2026-05-20T09:13:41+00:00
Deimos
https://cdn.greatkenyanjobs.com/jsjobsdata/data/default_logo_company/defaultlogo.png
FULL_TIME
Nairobi
Nairobi
00100
Kenya
Information Technology
Computer & IT, Science & Engineering
KES
MONTH
2026-05-27T17:00:00+00:00
TELECOMMUTE
8

Background information about the job or company (e.g., role context, company overview)

Businesses today are adopting the cloud for improved services to their customers. Our purpose is to guide companies on that journey to drive the adoption of DevSecOps so that our clients can remain ahead of the curve. We have an intense focus on engineering fundamentals, whether Developer and Security Operations, Cloud Native Transformation Strategy or So...

Read more about this company

Role Overview

We are looking for an experienced Senior Site Reliability Engineer to join our Professional Services team and deliver Software and DevSecOps projects. You will report to a Site Reliability Engineering Manager.

SRE / DevOps is one of our core competencies. You will be part of a highly-skilled team that continuously innovates and delivers high value solutions to clients across various industries on all public clouds (AWS, Azure, GCP, etc). Technologies we work with daily include Kuberenetes, Helm, Terraform, GitOps, just to name a few.

Responsibilities or duties

Enablement & RelOps Culture

  • Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.
  • Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
  • Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).

Frameworks & Automation

  • Standardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.
  • Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.
  • Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).
  • Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.

Qualifications or requirements (e.g., education, skills)

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write production-grade automation code.
  • Cloud & IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation).
  • Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). Strong understanding of SRE concepts including Golden Signals, high-cardinality data handling, and error budget mathematics.
  • Systems Thinking: Strong grasp of designing for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.
  • Proven ability to define and drive reliability standards across multiple teams and drive a blameless post-mortem culture.

Qualities & Behaviours

  • Exceptional interpersonal and communication skills
  • A zest for automation.
  • Comfortable working as a remote team member.
  • Ability to keep up to date with DevOps/SRE best practices, trends and innovation.
  • Passionate about mentoring and growing technical skills within the team.

Expected Output for the role

  • Automate Azure infrastructure provisioning and configuration using PowerShell, YAML and Bicep.
  • Monitor and troubleshoot issues in the Azure environment, including network, storage, and compute resources.
  • Deploy and manage Azure Databricks infrastructure for data processing and analytics.
  • Attend to support tickets, which may arise due to product components not functioning as expected.
  • Develop and maintain technical support documentation of the product.
  • Promote innovations to support business requirements through activities that test, pilot and implement innovative concepts.
  • Responsible for support and troubleshooting DevOps tools and processes for stakeholders

About you

For us to achieve our ambitious vision together as a team, It is important for our Martians to lead at all levels, be self starters who take initiative and put their hands up for challenging tasks. A growth mindset is important to us and we encourage all our Martians to openly share knowledge, support and help each other, ask questions, get creative with new technologies and learn from setbacks.

Becoming a Martian means:

  • Comfortably working and learning from a fully remote, culturally diverse team based predominantly in South Africa, Kenya, Nigeria and Ghana.
  • Being an open, honest and respectful communicator.
  • You enjoy asking questions, identifying areas of improvement and proposing solutions, no matter your job title or whether you have been with us for a day, a month or years!
  • You are comfortable taking initiative and operating independently.
  • You thrive in a fast paced environment, where change is constant.
  • You find it exciting to work with various clients, from different industries, each with a different problem for you and your team to solve.
  • Intentionally sharing tech and industry trends that excite you with your peers.
  • Seeking continuous feedback and actively taking steps to continuously grow personally and professionally.

What you get by joining us?

  • Become a member of a team where we value each individual's contribution from day 1 and empower you to make suggestions, get involved and do what you love most!
  • Flexibility and the freedom to work remotely.
  • Work-life balance where you are not expected to work over weekends or after hours.
  • A forward thinking remote company that knows how important it is to stay connected as one team, by providing virtual social platforms for employee engagement.
  • A monthly work from home allowance which you can use to set yourself up to work comfortably from home. Whether that is pens, notebooks, new headphones or work snacks!
  • A MacBook or Windows laptop for you to do your best work on.
  • Become part of a team of exceptionally clever and talented people who like to share their knowledge and learnings.
  • We support your career growth and love to celebrate your successes and advancement!
  • Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.
  • Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
  • Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).
  • Standardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.
  • Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.
  • Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).
  • Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.
  • Automate Azure infrastructure provisioning and configuration using PowerShell, YAML and Bicep.
  • Monitor and troubleshoot issues in the Azure environment, including network, storage, and compute resources.
  • Deploy and manage Azure Databricks infrastructure for data processing and analytics.
  • Attend to support tickets, which may arise due to product components not functioning as expected.
  • Develop and maintain technical support documentation of the product.
  • Promote innovations to support business requirements through activities that test, pilot and implement innovative concepts.
  • Responsible for support and troubleshooting DevOps tools and processes for stakeholders
  • Python
  • Kubernetes
  • Helm
  • Terraform
  • GitOps
  • AWS
  • Azure
  • GCP
  • CloudFormation
  • DataDog
  • Prometheus
  • ELK stack
  • PowerShell
  • YAML
  • Bicep
  • Azure Databricks
  • DevOps
  • DevSecOps
  • CI/CD
  • IaC
  • Observability
  • Monitoring
  • Alerting
  • On-Call
  • Disaster Recovery
  • Automation
  • Systems Thinking
  • Graceful Failure
  • Circuit Breaking
  • Connection Pooling
  • Multi-AZ Deployments
  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • 5+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering, with demonstrable ownership of reliability standards at a team or company level.
  • Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write production-grade automation code.
  • Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation).
  • Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). Strong understanding of SRE concepts including Golden Signals, high-cardinality data handling, and error budget mathematics.
  • Strong grasp of designing for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.
  • Proven ability to define and drive reliability standards across multiple teams and drive a blameless post-mortem culture.
  • Exceptional interpersonal and communication skills
  • A zest for automation.
  • Comfortable working as a remote team member.
  • Ability to keep up to date with DevOps/SRE best practices, trends and innovation.
  • Passionate about mentoring and growing technical skills within the team.
bachelor degree
12
JOB-6a0d7b45be3e9

Vacancy title:
Senior Site Reliability Engineer

[Type: FULL_TIME, Industry: Information Technology, Category: Computer & IT, Science & Engineering]

Jobs at:
Deimos

Deadline of this Job:
Wednesday, May 27 2026

Duty Station:
This Job is Remote

Summary
Date Posted: Wednesday, May 20 2026, Base Salary: Not Disclosed

Similar Jobs in Kenya
Learn more about Deimos
Deimos jobs in Kenya

JOB DETAILS:

Background information about the job or company (e.g., role context, company overview)

Businesses today are adopting the cloud for improved services to their customers. Our purpose is to guide companies on that journey to drive the adoption of DevSecOps so that our clients can remain ahead of the curve. We have an intense focus on engineering fundamentals, whether Developer and Security Operations, Cloud Native Transformation Strategy or So...

Read more about this company

Role Overview

We are looking for an experienced Senior Site Reliability Engineer to join our Professional Services team and deliver Software and DevSecOps projects. You will report to a Site Reliability Engineering Manager.

SRE / DevOps is one of our core competencies. You will be part of a highly-skilled team that continuously innovates and delivers high value solutions to clients across various industries on all public clouds (AWS, Azure, GCP, etc). Technologies we work with daily include Kuberenetes, Helm, Terraform, GitOps, just to name a few.

Responsibilities or duties

Enablement & RelOps Culture

  • Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build dashboards that track specific error budgets.
  • Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
  • Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).

Frameworks & Automation

  • Standardised Alerting & On-Call: Continuously improve company-wide alerting and on-call frameworks to reduce alert fatigue, ensuring alerts are highly actionable and symptom-based.
  • Disaster Recovery: Drive evolution of DR strategies from manual processes into fully automated runbooks-as-code, allowing teams to prove and improve service recoverability through autonomous, evidence-based testing.
  • Eliminate Toil: Develop systems, automations, and tooling for pre- and post-deployment verification, ensuring our hands-off reliability vision becomes a production reality, via Python (or similar).
  • Reliability-as-Code: Lead the drive to manage our entire reliability suite through IaC. Use Terraform to architect, deploy, and configure our observability stack including ELK, Grafana, Loki, Prometheus, and Tracing.

Qualifications or requirements (e.g., education, skills)

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write production-grade automation code.
  • Cloud & IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation).
  • Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). Strong understanding of SRE concepts including Golden Signals, high-cardinality data handling, and error budget mathematics.
  • Systems Thinking: Strong grasp of designing for scale and resilience, including graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.
  • Proven ability to define and drive reliability standards across multiple teams and drive a blameless post-mortem culture.

Qualities & Behaviours

  • Exceptional interpersonal and communication skills
  • A zest for automation.
  • Comfortable working as a remote team member.
  • Ability to keep up to date with DevOps/SRE best practices, trends and innovation.
  • Passionate about mentoring and growing technical skills within the team.

Expected Output for the role

  • Automate Azure infrastructure provisioning and configuration using PowerShell, YAML and Bicep.
  • Monitor and troubleshoot issues in the Azure environment, including network, storage, and compute resources.
  • Deploy and manage Azure Databricks infrastructure for data processing and analytics.
  • Attend to support tickets, which may arise due to product components not functioning as expected.
  • Develop and maintain technical support documentation of the product.
  • Promote innovations to support business requirements through activities that test, pilot and implement innovative concepts.
  • Responsible for support and troubleshooting DevOps tools and processes for stakeholders

About you

For us to achieve our ambitious vision together as a team, It is important for our Martians to lead at all levels, be self starters who take initiative and put their hands up for challenging tasks. A growth mindset is important to us and we encourage all our Martians to openly share knowledge, support and help each other, ask questions, get creative with new technologies and learn from setbacks.

Becoming a Martian means:

  • Comfortably working and learning from a fully remote, culturally diverse team based predominantly in South Africa, Kenya, Nigeria and Ghana.
  • Being an open, honest and respectful communicator.
  • You enjoy asking questions, identifying areas of improvement and proposing solutions, no matter your job title or whether you have been with us for a day, a month or years!
  • You are comfortable taking initiative and operating independently.
  • You thrive in a fast paced environment, where change is constant.
  • You find it exciting to work with various clients, from different industries, each with a different problem for you and your team to solve.
  • Intentionally sharing tech and industry trends that excite you with your peers.
  • Seeking continuous feedback and actively taking steps to continuously grow personally and professionally.

What you get by joining us?

  • Become a member of a team where we value each individual's contribution from day 1 and empower you to make suggestions, get involved and do what you love most!
  • Flexibility and the freedom to work remotely.
  • Work-life balance where you are not expected to work over weekends or after hours.
  • A forward thinking remote company that knows how important it is to stay connected as one team, by providing virtual social platforms for employee engagement.
  • A monthly work from home allowance which you can use to set yourself up to work comfortably from home. Whether that is pens, notebooks, new headphones or work snacks!
  • A MacBook or Windows laptop for you to do your best work on.
  • Become part of a team of exceptionally clever and talented people who like to share their knowledge and learnings.
  • We support your career growth and love to celebrate your successes and advancement!

Work Hours: 8

Experience in Months: 12

Level of Education: bachelor degree

Job application procedure

Application Link:Click Here to Apply Now

All Jobs | QUICK ALERT SUBSCRIPTION

Job Info
Job Category: Engineering jobs in Kenya
Job Type: Full-time
Deadline of this Job: Wednesday, May 27 2026
Duty Station: This Job is Remote
Posted: 20-05-2026
No of Jobs: 1
Start Publishing: 20-05-2026
Stop Publishing (Put date of 2030): 10-10-2076
Apply Now
Notification Board

Join a Focused Community on job search to uncover both advertised and non-advertised jobs that you may not be aware of. A jobs WhatsApp Group Community can ensure that you know the opportunities happening around you and a jobs Facebook Group Community provides an opportunity to discuss with employers who need to fill urgent position. Click the links to join. You can view previously sent Email Alerts here incase you missed them and Subscribe so that you never miss out.

Caution: Never Pay Money in a Recruitment Process.

Some smart scams can trick you into paying for Psychometric Tests.