About:

Step forward into the future of technology with ZILO™.

We’re here to redefine what’s possible in technology. While we’re trusted by the global Transfer Agency sector, our technology is truly flexible and designed to transform any business at scale. We’ve created a unified platform that adapts to diverse needs, offering the scalability and reliability legacy systems simply can’t match.

At ZILO™, our DNA is built on Character, Creativity, and Craftsmanship. We face every challenge with integrity, explore new ideas with a curious mind, and set a high standard in every detail.

We are a team of dedicated professionals where everyone, regardless of their role, drives our progress and creates real impact. If you’re ready to shape the future, let’s talk.

Requirements

About the Role

We’re looking for a Senior Site Reliability Engineer to join our SRE team. This is a hybrid role that blends deep platform engineering with application-level troubleshooting. You’ll be responsible for the stability, performance, and resilience of our cloud-native infrastructure while also being on the front line when issues affect our users and services.

This is a high-impact role ideal for someone who thrives in a modern DevOps culture, cares about both systems uptime and customer experience, and is comfortable working across infrastructure and application layers.

Key Responsibilities

🛠️ Infrastructure Reliability & Operations

Own patching, upgrades, and maintenance of AWS and EKS infrastructure
Define and implement resilience and failover strategies for microservices and core platforms
Continuously monitor and improve system performance, cost-efficiency, and observability (LGTM stack / Datadog)
Partner with security teams on compliance and vulnerability remediation

⚙️ Chaos Engineering & Resilience

Design and execute Chaos Engineering experiments.
Develop and track SLOs, SLIs, and error budgets for critical systems
Conduct resilience reviews and game days to validate system behavior under failure

💬 Kafka & Eventing

Ensure Kafka clusters are optimally configured for performance and durability
Support producers/consumers and troubleshoot event delivery and retention issues
Monitor and tune partitioning, replication, throughput, and latency

🧩 Application-Level Incident Support

Respond to production incidents — from user-facing UI errors to backend service disruptions
Investigate issues across infrastructure, Kubernetes, logs, traces, and service code
Resolve incidents and support root causes (Java and GoLang services)
Contribute to postmortems and reliability engineering initiatives

Who You Are

✅ Essential Experience

5+ years in an SRE, DevOps, or infrastructure role
Deep hands-on experience with AWS, EKS/Kubernetes, and Terraform
Working knowledge of Kafka tuning, monitoring, and operational troubleshooting
Strong familiarity to be able to read code and trace failures in one or more of the following application languages

Java
GoLang
React
.NET
Python

Solid understanding of modern observability tooling (e.g., Datadog, Loki, Grafana)
Comfortable working on a shared on-call rotation

Benefits

Enhanced leave - 38 days inclusive of 8 UK Public Holidays 
Private Health Care including family cover 
Life Assurance – 5x salary 
Flexible working-work from home and/or in our London Office 
Employee Assistance Program 
Company Pension (Salary Sacrifice options available)
Access to training and development 
Buy and Sell holiday scheme
The opportunity for “work from anywhere/global mobility”

Save Apply

Report job

Site Reliability Engineer

About the Role

Key Responsibilities

🛠️ Infrastructure Reliability & Operations

⚙️ Chaos Engineering & Resilience

💬 Kafka & Eventing

🧩 Application-Level Incident Support

Who You Are

✅ Essential Experience

Site Reliability Engineer (SRE) - Front-end/React Specialist

Site Reliability Engineer

Senior Software Engineer I - Send Core Reliability

Field Application Engineer - UPS & Data Center Solutions (on-site)

Field Service Engineer UPS (on-site)