Site Reliability Engineer
G-Research is Europe’s leading quantitative finance research firm. We hire the brightest minds in the world to tackle some of the biggest questions in finance. We pair this expertise with machine learning, big data, and some of the most advanced technology available to predict movements in the financial markets.
The Site Reliability Engineering (SRE) team is a knowledge centre for best practices around reliability, operational maturity and highly performant services. We are a central, cross-function team that provides consultancy and experienced advice throughout the company (not just within Engineering).
We are looking for Site Reliability Engineers, operationally-minded software developers, or infrastructure specialists who understand that "operational excellence" is more than just the ability to hack on Jenkinsfiles. Do you want to get involved in solving hard operational problems and making difficult trade-offs at an organisational level, while building the real tools and abstractions that help others do the same? Come join a team who are as driven by improving themselves and others as you are.
As a member of the SRE team, you will be responsible for:
- Acting as a guide for technical operations and development teams, using data-driven decisions to inform priorities
- Consulting with teams to improve all aspects of software reliability, release management and testing
- Designing and delivering internal educational content
- Driving internal processes for incident management and blameless post-mortem culture adoption, as well as operational maturity
- Building and maintaining an observability stack such that all teams can easily monitor their services
- Engaging with our software engineering teams on support issues and improvements to our tooling, processes, and infrastructure
- Transforming culture through high-impact innovation and education
Who are we looking for?
The ideal candidate will have:
- In-depth knowledge and experience in at least one of: host-based networking, systems administration, systems programming, distributed systems, databases, cloud computing, or Kubernetes (and a desire to learn more)
- The ability to educate members of the organisation on SRE best practices
- The ability to leverage off-the–shelf/open source systems and utilities to provision production systems rapidly in a variety of domains, especially for multi-tenant use
- A proven track record of automation and an algorithmic approach to solving problems
- A proactive approach to spotting problems, areas for improvement, performance bottlenecks, etc.
- The ability to understand the inherent trade-offs between various software architectures as it relates to performance, resiliency/fault tolerance, load balancing, data consistency
- Ability to profile and debug applications in real-time
The following additional skills are not required, but would be beneficial:
- Experience with authentication and encryption technologies like SSL, Kerberos, Oauth
- Distributed systems knowledge such as Kubernetes, Docker Swarm, Borg, Tupperware
- Networking experience, analysing packet dumps, packet filtering, load balancers
- OS/kernel experience, such as familiarity with OS tunables, log analysis
- Experience with automated configuration management tools like Ansible, Terraform, CloudFormation
- Experience with distributed storage technologies like NFS, HDFS, Ceph, S3
- Experience with a full observability stack with technologies that include Prometheus, Jaeger, Grafana, Elastic
Why should you apply?
- Highly competitive compensation plus annual discretionary bonus
- Informal dress code and excellent work/life balance
- Comprehensive healthcare and life assurance
- 25 days holiday
- 9% company pension contributions
- Cycle-to-work scheme
- Subsidised gym membership
- Monthly company events
- Central London office close to 5 stations and 6 tube lines