SRE Engineer job at Equity Bank
Website :
19 Days Ago
Linkedid Twitter Share on facebook
SRE Engineer
2026-04-01T11:15:02+00:00
Equity Bank
https://cdn.greatkenyanjobs.com/jsjobsdata/data/employer/comp_7833/logo/Equity%20Bank.png
FULL_TIME
Nairobi
Nairobi
00100
Kenya
Banking
Computer & IT, Science & Engineering
KES
MONTH
2026-04-15T17:00:00+00:00
8

Background information about the job or company (e.g., role context, company overview)

Equity Bank Limited (The "Bank”) is incorporated, registered under the Kenyan Companies Act Cap 486 and domiciled in Kenya. The address of the Bank’s registered office is 9th Floor, Equity Centre, P.O. Box 75104 - 00200 Nairobi. The Bank is licensed under the Kenya Banking Act (Chapter 488), and continues to offer retail banking, microfinance and relat...

The Site Reliability Engineer (SRE) is responsible for improving the reliability, scalability, availability, and performance of enterprise systems through automation, infrastructure-as-code, and engineering‑driven operational practices.The role focuses on reducing operational toil, enabling efficient CI/CD deployments, optimizing system capacity and performance, and working closely with development teams to design resilient, self‑healing systems supported by strong monitoring, documentation, and operational standards.

Responsibilities or duties

  • Install, configure, and maintain ELK stack components (Elasticsearch, Logstash, Kibana, Beats) across environments.
  • Design efficient dashboards, graphs, and visualizations that translate application logs into business‑readable insights.
  • Analyze application logs to identify trends, risks, and incidents affecting system performance and availability.
  • Develop customized reports, bar charts, and pie charts to support operational and business decision‑making.
  • Implement ELK‑triggered auto‑healing and remediation scripts to detect and resolve incidents proactively.
  • Identify repetitive, manual, and reactive operational tasks and eliminate them through automation.
  • Develop scripts and tools using languages such as Python, Bash, or Go to automate system maintenance and operational workflows.
  • Implement Infrastructure as Code (IaC) using tools such as Terraform or Ansible to ensure consistent, repeatable infrastructure provisioning.
  • Design and implement self‑healing systems capable of automatic recovery from common failures without human intervention.
  • Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) in collaboration with business and development teams.
  • Build and maintain robust monitoring, logging, and observability solutions using tools such as ELK, Prometheus, Grafana, or equivalent platforms.
  • Configure intelligent, actionable alerts that minimize noise and false positives while ensuring rapid incident detection.
  • Continuously improve monitoring coverage and system visibility to support proactive operations.
  • Participate in on‑call rotations to respond to critical system alerts and production incidents.
  • Diagnose, mitigate, and resolve incidents to restore services within agreed SLAs.
  • Conduct blameless post‑incident reviews to identify root causes and define preventative actions.
  • Develop and maintain runbooks and playbooks for common incident scenarios to improve response time and consistency.
  • Analyze historical system usage and trends to forecast future capacity requirements.
  • Perform system and database performance tuning in collaboration with development teams.
  • Conduct load and stress testing to identify bottlenecks before they impact production systems.
  • Ensure systems are cost‑efficient, scalable, and capable of supporting business growth.
  • Work closely with software development teams during solution design to ensure reliability, scalability, and operational readiness.
  • Promote a DevOps and SRE culture through shared ownership of system reliability (“You Build It, You Run It”).
  • Share knowledge, best practices, and documentation to uplift operational maturity across teams.

Qualifications or requirements (e.g., education, skills)

  • Elasticsearch, Logstash, Kibana (ELK Stack)
  • Microsoft Azure
  • Unix / Linux and Shell Scripting
  • SQL and database concepts
  • Monitoring and observability tools
  • Strong analytical, problem‑solving, and documentation skills
  • Bachelor’s degree in science, Engineering, Information Technology, or a related field
  • Nice to have: ELK, Azure, or other relevant cloud/observability certifications

Experience needed

  • Minimum 2 years’ experience in a Site Reliability Engineering, DevOps, or Production Support role
  • Mandatory hands‑on experience with ELK Stack
  • Experience supporting banking or enterprise‑scale applications
  • Install, configure, and maintain ELK stack components (Elasticsearch, Logstash, Kibana, Beats) across environments.
  • Design efficient dashboards, graphs, and visualizations that translate application logs into business‑readable insights.
  • Analyze application logs to identify trends, risks, and incidents affecting system performance and availability.
  • Develop customized reports, bar charts, and pie charts to support operational and business decision‑making.
  • Implement ELK‑triggered auto‑healing and remediation scripts to detect and resolve incidents proactively.
  • Identify repetitive, manual, and reactive operational tasks and eliminate them through automation.
  • Develop scripts and tools using languages such as Python, Bash, or Go to automate system maintenance and operational workflows.
  • Implement Infrastructure as Code (IaC) using tools such as Terraform or Ansible to ensure consistent, repeatable infrastructure provisioning.
  • Design and implement self‑healing systems capable of automatic recovery from common failures without human intervention.
  • Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) in collaboration with business and development teams.
  • Build and maintain robust monitoring, logging, and observability solutions using tools such as ELK, Prometheus, Grafana, or equivalent platforms.
  • Configure intelligent, actionable alerts that minimize noise and false positives while ensuring rapid incident detection.
  • Continuously improve monitoring coverage and system visibility to support proactive operations.
  • Participate in on‑call rotations to respond to critical system alerts and production incidents.
  • Diagnose, mitigate, and resolve incidents to restore services within agreed SLAs.
  • Conduct blameless post‑incident reviews to identify root causes and define preventative actions.
  • Develop and maintain runbooks and playbooks for common incident scenarios to improve response time and consistency.
  • Analyze historical system usage and trends to forecast future capacity requirements.
  • Perform system and database performance tuning in collaboration with development teams.
  • Conduct load and stress testing to identify bottlenecks before they impact production systems.
  • Ensure systems are cost‑efficient, scalable, and capable of supporting business growth.
  • Work closely with software development teams during solution design to ensure reliability, scalability, and operational readiness.
  • Promote a DevOps and SRE culture through shared ownership of system reliability (“You Build It, You Run It”).
  • Share knowledge, best practices, and documentation to uplift operational maturity across teams.
  • Elasticsearch, Logstash, Kibana (ELK Stack)
  • Microsoft Azure
  • Unix / Linux and Shell Scripting
  • SQL and database concepts
  • Monitoring and observability tools
  • Strong analytical, problem‑solving, and documentation skills
  • Bachelor’s degree in science, Engineering, Information Technology, or a related field
  • Nice to have: ELK, Azure, or other relevant cloud/observability certifications
bachelor degree
12
JOB-69ccfe364e87e

Vacancy title:
SRE Engineer

[Type: FULL_TIME, Industry: Banking, Category: Computer & IT, Science & Engineering]

Jobs at:
Equity Bank

Deadline of this Job:
Wednesday, April 15 2026

Duty Station:
Nairobi | Nairobi

Summary
Date Posted: Wednesday, April 1 2026, Base Salary: Not Disclosed

Similar Jobs in Kenya
Learn more about Equity Bank
Equity Bank jobs in Kenya

JOB DETAILS:

Background information about the job or company (e.g., role context, company overview)

Equity Bank Limited (The "Bank”) is incorporated, registered under the Kenyan Companies Act Cap 486 and domiciled in Kenya. The address of the Bank’s registered office is 9th Floor, Equity Centre, P.O. Box 75104 - 00200 Nairobi. The Bank is licensed under the Kenya Banking Act (Chapter 488), and continues to offer retail banking, microfinance and relat...

The Site Reliability Engineer (SRE) is responsible for improving the reliability, scalability, availability, and performance of enterprise systems through automation, infrastructure-as-code, and engineering‑driven operational practices.The role focuses on reducing operational toil, enabling efficient CI/CD deployments, optimizing system capacity and performance, and working closely with development teams to design resilient, self‑healing systems supported by strong monitoring, documentation, and operational standards.

Responsibilities or duties

  • Install, configure, and maintain ELK stack components (Elasticsearch, Logstash, Kibana, Beats) across environments.
  • Design efficient dashboards, graphs, and visualizations that translate application logs into business‑readable insights.
  • Analyze application logs to identify trends, risks, and incidents affecting system performance and availability.
  • Develop customized reports, bar charts, and pie charts to support operational and business decision‑making.
  • Implement ELK‑triggered auto‑healing and remediation scripts to detect and resolve incidents proactively.
  • Identify repetitive, manual, and reactive operational tasks and eliminate them through automation.
  • Develop scripts and tools using languages such as Python, Bash, or Go to automate system maintenance and operational workflows.
  • Implement Infrastructure as Code (IaC) using tools such as Terraform or Ansible to ensure consistent, repeatable infrastructure provisioning.
  • Design and implement self‑healing systems capable of automatic recovery from common failures without human intervention.
  • Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) in collaboration with business and development teams.
  • Build and maintain robust monitoring, logging, and observability solutions using tools such as ELK, Prometheus, Grafana, or equivalent platforms.
  • Configure intelligent, actionable alerts that minimize noise and false positives while ensuring rapid incident detection.
  • Continuously improve monitoring coverage and system visibility to support proactive operations.
  • Participate in on‑call rotations to respond to critical system alerts and production incidents.
  • Diagnose, mitigate, and resolve incidents to restore services within agreed SLAs.
  • Conduct blameless post‑incident reviews to identify root causes and define preventative actions.
  • Develop and maintain runbooks and playbooks for common incident scenarios to improve response time and consistency.
  • Analyze historical system usage and trends to forecast future capacity requirements.
  • Perform system and database performance tuning in collaboration with development teams.
  • Conduct load and stress testing to identify bottlenecks before they impact production systems.
  • Ensure systems are cost‑efficient, scalable, and capable of supporting business growth.
  • Work closely with software development teams during solution design to ensure reliability, scalability, and operational readiness.
  • Promote a DevOps and SRE culture through shared ownership of system reliability (“You Build It, You Run It”).
  • Share knowledge, best practices, and documentation to uplift operational maturity across teams.

Qualifications or requirements (e.g., education, skills)

  • Elasticsearch, Logstash, Kibana (ELK Stack)
  • Microsoft Azure
  • Unix / Linux and Shell Scripting
  • SQL and database concepts
  • Monitoring and observability tools
  • Strong analytical, problem‑solving, and documentation skills
  • Bachelor’s degree in science, Engineering, Information Technology, or a related field
  • Nice to have: ELK, Azure, or other relevant cloud/observability certifications

Experience needed

  • Minimum 2 years’ experience in a Site Reliability Engineering, DevOps, or Production Support role
  • Mandatory hands‑on experience with ELK Stack
  • Experience supporting banking or enterprise‑scale applications

Work Hours: 8

Experience in Months: 12

Level of Education: bachelor degree

Job application procedure

Never pay for any notarisation, certificate or assessment as part of any recruitment process. When in doubt, contact us

Click Here to Apply Now

All Jobs | QUICK ALERT SUBSCRIPTION

Job Info
Job Category: Engineering jobs in Kenya
Job Type: Full-time
Deadline of this Job: Wednesday, April 15 2026
Duty Station: Nairobi | Nairobi
Posted: 01-04-2026
No of Jobs: 1
Start Publishing: 01-04-2026
Stop Publishing (Put date of 2030): 10-10-2076
Apply Now
Notification Board

Join a Focused Community on job search to uncover both advertised and non-advertised jobs that you may not be aware of. A jobs WhatsApp Group Community can ensure that you know the opportunities happening around you and a jobs Facebook Group Community provides an opportunity to discuss with employers who need to fill urgent position. Click the links to join. You can view previously sent Email Alerts here incase you missed them and Subscribe so that you never miss out.

Caution: Never Pay Money in a Recruitment Process.

Some smart scams can trick you into paying for Psychometric Tests.