Description & Requirements
Are you passionate about building high-performance systems that are fast, resilient, and operate at global scale? Join Bloomberg’s Application Middleware SRE team, where you’ll combine software engineering and systems expertise to keep the backbone of the Bloomberg Terminal running smoothly for hundreds of thousands of users around the world.We’re not your typical SRE team. We’re embedded in a group that powers real-time connectivity, and we own systems where uptime isn't just important—it’s essential to the global financial system. This is your opportunity to engineer resilience at scale, automate critical infrastructure, and shape reliability practices across one of the world’s most powerful tech platforms.
The Team
We’re the Site Reliability Engineering team within Bloomberg’s Application Middleware group. Our mission: ensure that Bloomberg’s core connectivity and messaging layers are resilient, scalable, and fully observable.
We own systems that operate at high throughput and low latency, including:
- Gateways: Secure, high-performance TCP/SSL entry points to our data centers
- HFN & NSTP: A global HTTP CDN and SOCKS5 proxy network delivering fast access from any geography
- Playlist Services: Dynamic path configuration systems optimizing user connectivity in real-time
- PGM Relays: Infrastructure for reliable multicast data delivery
What You’ll Do
- Build production-grade software that powers Bloomberg’s global infrastructure
- Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation
- Collaborate across engineering teams to introduce automated, self-service operational workflows
- Conduct deep systems analysis and root cause investigations for complex, distributed systems
- Propose and prototype innovative approaches to reliability and risk mitigation
- Contribute to design docs, runbooks, and post-incident reviews—clear communication is part of the job
- A degree in Computer Science, Engineering, Mathematics, or equivalent practical experience
- Strong software engineering skills in any high-level language (we mainly use Python and C++)
- A deep understanding of software system reliability and risk management—including how to identify potential points of failure and design mitigation strategies.
- A good understanding of data structures, algorithms, and system design
- Experience navigating and improving large, distributed codebases
- An ability to identify system risks and engineer around points of failure
- Clear written and verbal communication, including technical documentation and incident analysis
We are building a team with a breadth of expertise and value depth in any of the following areas:
- Systems Knowledge: A strong grasp of operating systems, fundamental networking protocols (TCP, UDP, multicast), or core database concepts as they apply to modern infrastructure.
- Cluster Management: Experience with deployments, staging, and configuration management. Direct experience with Argo and/or Kubernetes or other Pipeline Management Platforms is a significant advantage.
- Machine Management at Scale: Experience with capacity planning and automating the lifecycle of large machine fleets.
- System Observability and Monitoring: Deep understanding of SLIs/SLOs/SLAs, alerting, and building dashboards for complex systems.
- Reliability in Distributed Systems: Knowledge of fault tolerance and the unique challenges of network and node failure in distributed environments.
- Mentoring: Proven experience mentoring and growing junior Engineers