Vacancy title:
Senior Site Reliability Engineer
Jobs at:
Twiga FoodsDeadline of this Job:
26 October 2022
Summary
Date Posted: Wednesday, October 12, 2022 , Base Salary: Not Disclosed
JOB DETAILS:
• The role holder will be responsible for leading the end-to-end design, development and deployment of engineering solutions to run scalable, distributed and fault-tolerant software systems for Twiga Foods. The role holder will lead the implementation of automated solutions to ensure uptime, reliability and improvement of Twiga Food’s systems in line with set service level objectives. He/she will be required to provide leadership in determining software engineering needs from product/engineering requirements and collaborating across the organisation to clarify requirements and expected outcomes.
• They are also accountable for work assigned, ensuring that it is broken down into a plan with estimates, priorities and deliverables; ensuring that adherence to the plan and communicating when any adjustments to scope are needed to meet deadlines.
• Additionally, he/she will contribute to the wellbeing of the Twiga technology ecosystem by tracking production systems’ capacity and performance, fixing issues and taking on-call responsibilities.
Key Responsibilities
Site Reliability
• Collaborate with other cross-functional teams to design, develop, and deliver required software
• Develop, manage and support SRE tools and applications.
• Lead/own and drive the development/implementation of SRE tools within the Product/Technical Requirements Document.
• Develop or review technical specification documents within the SRE team and wider engineering team.
• Lead the deployment, training, and rollout of major/minor SRE tools across various engineering/tech teams.
• Deliver feature work consistently and on time whilst still tackling tech debt. Ensure that code fits agreed, accuracy, testability, and efficiency and style guidelines. Software systems that meet agreed SLO for performance and reliability
• Produce a work breakdown structure with estimates, deadlines, and deliverables. Own features from technical specification, implementation right through to deployment into production
• Engage in improving the software development lifecycle, providing feedback on requirements, architecture,
• Build resilience into systems so underlying failures are handled gracefully and do not impact end users.
• Develop automated predictive analysis of future capacity needs and proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage.
• Manage individual priorities, deadlines, and deliverables.
• Defend and challenge technical decisions made through solution design and code review feedback
• Finalise and own technical documentation for the developed features
On-Call Technical Support
• Monitor application availability and performance, take steps to improve overall application performance and stability, and follow through with implementation
• Participate in
• Triage system issues and debug/track/resolve by analysing the sources and offering corrective measures. Through end-to-end incident response and management.
• Drive efficiencies through systems improvement and root cause analysis resulting in service delivery,
• Analyze logs and telemetry data by writing monitoring and automation code.
• Identify and automate repetitive, manual, and non-tactical work that impacts software development and deployment.
Innovation
• Investigate site reliability technologies and their applicability to the Twiga ecosystem. Identify significant projects that result in substantial improvements in reliability, cost savings and/or revenue.
• Provide reports on findings, with recommendations and a viable plan of action.
• Lead design reviews with peers and stakeholders to decide amongst available technologies
• Evaluate and review existing systems, SRE processes, & tools.
• Develop and lead implementation of a viable technical specification document in collaboration with members of SRE or engineering team.
• Contribute to the definition of SLOs for services/applications.
In-Team Collaboration
• Work with peers to build a stronger engineering team
• Lead process improvements that boost productivity and quality of Twiga engineering
• Regularly contribute improvements to existing documentation and codebase as per agreed standards.
• Review code developed by others and provide feedback to ensure adherence to Twiga Engineering best practices.
• Contribute regular knowledge shares through a variety of mediums including lunch and learn sessions.
• Provide mentorship for SRE engineers and interns in the section.
• Mentor/Coach/Train engineers on system design,
• Develop and maintain relationships with various engineering teams and their members.
• Acquire and maintain an understanding of multiple engineering teams processes and tools.
• Influences the engineering roadmap and works with engineering and/or product counterparts to influence improved resiliency and reliability of Twiga systems.
• Deep domain knowledge and radiation that knowledge through recorded demos, technical presentations, discussions, and Incident Reviews.
Self-management
• Model Twiga’s culture and way of working.
• Deliver the performance objectives set for the team. Hold monthly 1-on-1 performance reviews with line manager, and institute corrective action where performance falls below expectation.
• Proactively manage own learning and development
• Adhere to the annual leave plan agreed with the line manager
• Adhere to people management policies
Compliance
• Comply with all organization policies, procedures, and statutory guidelines. Minimize and mitigate risks to the organization and enforce zero-tolerance to non-compliance.
• Close gaps/lapses identified as an outcome of audits; risk and/or any other compliance review; investigations; or other assessment mechanisms and take corrective/preventive actions within the agreed timelines.
Minimum Qualifications & Requirements
• Degree in Engineering, Computer Science, Information Technology or a related discipline. Or demonstrated equivalent skill/competence.
• Minimum of 5 years of relevant experience
• Observability and monitoring of infrastructure, applications, services, and networks
• Troubleshooting issues across the entire stack (hardware, software, network etc.)
• Writing infrastructure as code and automation scripts
• Building and maintaining CI/CD pipelines
• Building, running, and optimising containers with Docker or ContainerD
• Setting up,
• operating highly available and reliable infrastructure
• At least 3 years’ experience working with relational databases (Postgres, MySQL or Microsoft SQL Server) non-relational, and in-memory data stores
• At least 2 years' experience creating/managing SLIs/SLOs/Error Budgets.
• Strong technical understanding of android, front-end and backend development
• Experience in design, implementing and securing distributed systems
• Strong experience with; Analysing logs, metrics and traces.
• Creating system reports and system alerts.
• The use, maintenance and configuration of monitoring, observability and telemetry metrics and logging infrastructure (Prometheus, Grafana, ELK, or Sentry)
• Understanding of Agile/Scrum development principles
• Understanding of ITIL incident and problem management practices
• Can work accurately and quickly, to ensure key project milestones are achieved within set timelines, even when working under pressure.
• Always have a positive attitude and approach to the role and team.
Work Hours: 8
Experience in Months: 60
Level of Education: Bachelor Degree
Job application procedure
• Interested and qualified? Click here to apply
All Jobs
Join a Focused Community on job search to uncover both advertised and non-advertised jobs that you may not be aware of. A jobs WhatsApp Group Community can ensure that you know the opportunities happening around you and a jobs Facebook Group Community provides an opportunity to discuss with employers who need to fill urgent position. Click the links to join. You can view previously sent Email Alerts here incase you missed them and Subscribe so that you never miss out.