About:
Step forward into the future of technology with ZILO™.
We’re here to redefine what’s possible in technology. While we’re trusted by the global Transfer Agency sector, our technology is truly flexible and designed to transform any business at scale. We’ve created a unified platform that adapts to diverse needs, offering the scalability and reliability legacy systems simply can’t match.
At ZILO™, our DNA is built on Character, Creativity, and Craftsmanship. We face every challenge with integrity, explore new ideas with a curious mind, and set a high standard in every detail.
We are a team of dedicated professionals where everyone, regardless of their role, drives our progress and creates real impact. If you’re ready to shape the future, let’s talk.
Requirements
About the Role
We’re looking for a Senior Site Reliability Engineer to join our SRE team. This is a hybrid role that blends deep platform engineering with application-level troubleshooting. You’ll be responsible for the stability, performance, and resilience of our cloud-native infrastructure while also being on the front line when issues affect our users and services.
This is a high-impact role ideal for someone who thrives in a modern DevOps culture, cares about both systems uptime and customer experience, and is comfortable working across infrastructure and application layers.
Key Responsibilities
🛠️ Infrastructure Reliability & Operations
- Own patching, upgrades, and maintenance of AWS and EKS infrastructure
- Define and implement resilience and failover strategies for microservices and core platforms
- Continuously monitor and improve system performance, cost-efficiency, and observability (LGTM stack / Datadog)
- Partner with security teams on compliance and vulnerability remediation
⚙️ Chaos Engineering & Resilience
- Design and execute Chaos Engineering experiments.
- Develop and track SLOs, SLIs, and error budgets for critical systems
- Conduct resilience reviews and game days to validate system behavior under failure
💬 Kafka & Eventing
- Ensure Kafka clusters are optimally configured for performance and durability
- Support producers/consumers and troubleshoot event delivery and retention issues
- Monitor and tune partitioning, replication, throughput, and latency
🧩 Application-Level Incident Support
- Respond to production incidents — from user-facing UI errors to backend service disruptions
- Investigate issues across infrastructure, Kubernetes, logs, traces, and service code
- Resolve incidents and support root causes (Java and GoLang services)
- Contribute to postmortems and reliability engineering initiatives
Who You Are
✅ Essential Experience
- 5+ years in an SRE, DevOps, or infrastructure role
- Deep hands-on experience with AWS, EKS/Kubernetes, and Terraform
- Working knowledge of Kafka tuning, monitoring, and operational troubleshooting
- Strong familiarity to be able to read code and trace failures in one or more of the following application languages
- Java
- GoLang
- React
- .NET
- Python
- Solid understanding of modern observability tooling (e.g., Datadog, Loki, Grafana)
- Comfortable working on a shared on-call rotation
Benefits
- Enhanced leave - 38 days inclusive of 8 UK Public Holidays
- Private Health Care including family cover
- Life Assurance – 5x salary
- Flexible working-work from home and/or in our London Office
- Employee Assistance Program
- Company Pension (Salary Sacrifice options available)
- Access to training and development
- Buy and Sell holiday scheme
- The opportunity for “work from anywhere/global mobility”