Site Reliability Engineer - Manager
G-Research is Europe’s leading quantitative finance research firm. We hire the brightest minds in the world to tackle some of the biggest questions in finance. We pair this expertise with machine learning, big data, and some of the most advanced technology available to predict movements in the financial markets.
The Site Reliability Engineering (SRE) team is a knowledge centre for best practices around reliability, operational maturity and highly performant services. We are a centralised, cross-functional team that provides consultancy and advice to Engineering and the company as a whole. We are not a textbook SRE function but closer to a Customer Reliability Engineering (CRE) team.
We are looking for an individual to lead this team through advocacy, education and outreach. Come join a team of people who are as driven by improving themselves and others as you are.
As the manager of the SRE team, you will be responsible for:
- Hiring, mentoring, leading, retaining and motivating a growing, independent SRE team
- Acting as a guide for technical operations and development teams, using data-driven decisions to inform the priorities and appropriately balance reliability, security and features
- Transforming culture through high impact innovation and education
- Driving internal processes for incident management and blameless post-mortem culture adoption, as well as operational maturity
- Building and maintaining an observability stack such that all teams can easily monitor their services. This includes owning and administering PagerDuty
- Engaging with our engineering teams on support issues and improvements to our tooling, processes, and infrastructure
Who are we looking for?
The ideal candidate will have the following traits, knowledge, and experience:
- The ability to advocate the benefits of SRE culture, such as the three pillars of the observability stack, incident management, SLIs and SLOs, and more
- The ability to educate members of the organisation on SRE best practices
- In-depth knowledge and experience in at least one of the following: host-based networking, systems administration, systems programming, distributed systems, databases, cloud computing, and Kubernetes
- A proactive approach to spotting problems, areas for improvement, performance bottlenecks, etc.
- The ability to understand the inherent trade-offs between various software architectures in relation to performance, resiliency/fault tolerance, load balancing, and data consistency
The following additional skills would be beneficial, but are not required:
- Experience with a full observability stack with technologies that include Prometheus, Jaeger, Grafana, Elastic
- Distributed systems knowledge such as Kubernetes, Docker Swarm, Borg, Tupperware
- Networking experience, analysing packet dumps, packet filtering, load balancers
- OS/kernel experience such as familiarity with OS tunables, log analysis
- Experience with automated configuration management tools like Ansible, Terraform, CloudFormation
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3
- Experience with authentication and encryption technologies like SSL, Kerberos, OAuth
Why should you apply?
- Highly competitive compensation plus annual discretionary bonus
- Informal dress code and excellent work/life balance
- Comprehensive healthcare and life assurance
- 25 days holiday
- 9% company pension contributions
- Cycle-to-work scheme
- Subsidised gym membership
- Monthly company events
- Central London office close to 5 stations and 6 tube lines
G-Research is committed to cultivating and preserving an inclusive work environment. We are an ideas-driven business and we place great value on diversity of experience and opinions.
We want to ensure that applicants receive a recruitment experience that enables them to perform at their best. If you have a disability or special need that requires accommodation please let us know in the relevant section.Apply