104 Devops Engineers jobs in London

Site Reliability Engineer

London, London £80000 - £90000 Annually Rise Technical Recruitment

Posted today

Job Viewed

Tap Again To Close

Job Description

permanent

Senior Site Reliability Engineer
London - Hybrid
80,000 - 90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension


Excellent opportunity for Site Reliability Engineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression!

This company operates at the forefront of digital transformation, delivering a unified platform built for scalability, resilience, and performance. With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries.

In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems.

The ideal candidate will be an experienced Site Reliability Engineer with a deep background in AWS, Kubernetes (EKS), Terraform, and monitoring/eventing tools. You'll have a strong grasp of application-level troubleshooting, chaos engineering, and performance tuning.

This is a fantastic opportunity to work in a modern DevOps environment where innovation is encouraged, personal development is supported, and technical impact is real.

The Role:
*Manage and optimise AWS and Kubernetes (EKS) infrastructure
*Implement resilience strategies and conduct chaos engineering experiments
*Monitor and maintain Kafka clusters for performance and reliability
*Respond to and resolve application-level production incidents

The Person:
*5+ years in SRE, DevOps, or infrastructure engineering
*Strong experience with AWS, EKS/Kubernetes, and Terraform
*Familiar with Kafka and observability tools like Datadog or Grafana
*Able to troubleshoot issues across infrastructure and application layers

Reference number: BBBH(phone number removed)

To apply for this role or for to be considered for further roles, please click "Apply Now" or contact Tommy Williams at Rise Technical Recruitment.

Rise Technical Recruitment Ltd acts an employment agency for permanent roles and an employment business for temporary roles.

The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set. We are an equal opportunities employer and welcome applications from all suitable candidates.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

London, London Cisco

Posted 2 days ago

Job Viewed

Tap Again To Close

Job Description

Join us on the Splunk TechOps team, empowering our customers to execute our vision making machine data accessible, usable, and valuable to everyone! The Splunk TechOps organization runs Splunk cloud, blending SRE, Systems Engineering and Service Engineering disciplines, across functional global teams.
Come join a team that is striving for operational awesomeness and trying to automate the world. We have a large presence with large cloud vendors. You should have experience with architecture, deployments, and networking in one or more of the major industry vendors. This is an incredible opportunity to use your existing cloud experience and drive the growth of Splunk Cloud.
**What we're looking for**
**NOTE:** **4 x 10h shifts: Wednesday - Saturday/8am-6pm**
We are looking for a TechOps SRE to help maintain, contribute to and improve the next generation of our large scale Cloud offering. You will be working with providers and supporting the infrastructure that powers Splunk's cloud offering.
**You should apply if**
+ **you are comfortable working 4 x 10h shifts: Wednesday - Saturday/8am-6pm**
+ You have operational experience at scale. You have had hands-on roles that deal with operating systems (particularly Linux) and networking. You might also have worked with Cloud technologies. Your previous job titles might be something close to systems admin, network engineer or devops engineer.
+ You're passionate about your work. Our customers are passionate about Splunk and we want the same from our engineers. You should enjoy actively being responsible for your work and be excited about your projects.
+ You love large complex systems. Experience in working on distributed systems or a passion for finding edge cases that appear at scale. You are interested in how to bring something from a small one off task to how to implement it across several thousand machines at once.
+ You have some development skills. We have code in several languages, ranging from Python and Shell to Go and C++. We don't expect you to be a software engineer but you should be familiar with basic programming and understand concepts like input sanitisation and unit testing.
+ "How can I automate this process?" is a question you constantly ask yourself.
+ Data drives your decisions. Data excites you and you make decisions based on numbers rather than assumptions. If an issue arises, you strive to be alerted before our customers notice.
+ You care about monitoring. Shipping code often and getting useful feedback excites you and you're not worried about changing direction when a solution isn't working as expected.
**What we provide**
+ Opportunities to develop and grow as an engineer. We are always expanding into new areas, working with open-source projects and contributing back, and exploring new technologies.
+ A team of incredibly capable and dedicated peers, all the way from engineering to product management and customer support.
+ Breadth and depth. You are interested to work in an area that dynamically scales to meet the need of Splunk's cloud offering. You want to go deep into optimizing how we automate every manual process and tedious task we encounter.
+ Growth and mentorship. We believe in growing engineers through ownership and leadership opportunities. We also believe that mentors help both sides of the equation.
+ A stable, collaborative, and supportive work environment. Honesty and collaboration are values we see as a core part of our team identity. We understand the value in open communication-working together to get things done, and to adapt to the changing needs of the team and individuals. This is reflected in both our internal communications and also in how we interact with our customers.
+ Balance. We don't expect people to work 12 hour days. We want you to be successful outside of work too. We trust our colleagues to be responsible with their time and commitment, and believe that balance helps cultivate a positive environment.
Splunk, a Cisco company, is an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis.
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

London, London Gizmo

Posted 5 days ago

Job Viewed

Tap Again To Close

Job Description

Permanent

Gizmo is an AI startup on a mission to make learning so easy that anyone can learn anything. We're building Duolingo for anything - a platform that uses gamification and social mechanics to make learning fun.  

With over 1 million monthly active users and $4M in annual recurring revenue, we’re already one of the fastest-growing startups in the UK. Backed by leading investors, we recently raised $22M in Series A funding to accelerate our vision of helping 1 billion people learn.

Role Overview
Reporting to the CTO, you will own capacity, performance and reliability for Gizmo’s full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You’ll write code across the stack, but your charter is classic SRE: defend SLOs , eliminate toil , and raise the ceiling on scale before it becomes a hard limit.

Key Responsibilities

  • Define SLIs/SLOs for latency, availability and error rate; codify error budgets and partner with product teams on trade-offs.
  • Perform load-testing, capacity modelling and up-front scalability design for PostgreSQL, OpenSearch, Redis, Hasura and CF Workers; produce data-driven scaling plans.
  • Extend metrics, structured logging and tracing; establish alert rules that page only on user-visible impact; build actionable runbooks .
  • Join the on-call rotation, lead blameless post-mortems, drive remediation work to closure and track MTTR/MTBF improvements.
  • Automate repetitive ops on Kubernetes and CI/CD; keep “toil” <50 % of your time by pushing fixes into code.
  • Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an SRE playbook

Requirements

  • Hands-on scale experience : you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL).
  • Strong backend fundamentals around concurrency, caching, indexing and distributed systems trade-offs.
  • Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts.
  • Comfort with Kubernetes , IaC and cloud-native patterns; can debug from network to application layer.
  • Start-up bias for action: you prioritise high-leverage fixes, ship iteratively and own outcomes end-to-end.
  • Collaborative and feedback-driven; you welcome post-mortem culture and continuous improvement.
  • Driven by impact - you prioritise work that moves the needle!

Nice-to-haves: experience with Hasura internals, Cloudflare Workers edge optimisation, or operating OpenSearch at scale.

Benefits

  • Highly competitive salary.
  • You'll own a piece of what you're building - equity included.
  • Hybrid working model with 4 days in our East London office, ideally located between Shoreditch High Street, Old Street, and Liverpool Street stations.
  • The opportunity to become one of the earliest employees in one of the UK’s fastest-growing startups.
  • Private health insurance
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

£75000 - £90000 annum orbit

Posted 592 days ago

Job Viewed

Tap Again To Close

Job Description

Permanent

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

TW16 Sunbury, South East BP Energy

Posted today

Job Viewed

Tap Again To Close

Job Description

Entity:

Technology


Job Family Group:

IT&S Group


Job Description:

About bp

bp is a global energy business with a purpose to reimagine energy for people and our planet. We aim to be a very different kind of energy company by 2030, helping the world reach net zero and improving people’s lives. We are committed to creating a diverse and inclusive environment where everyone can thrive. Join bp and become part of the team building our future!

You will work with

You will work as a member of a high-energy, top-performing team of engineers, working alongside technology leaders to shape the vision and drive the execution of ground-breaking compute and data platforms that make a real impact.

Let me tell you about the role

As a senior site reliability engineer, you will be responsible for building, maintaining and operating the software solutions, infrastructure and services that powers technology platforms. In this role, we work with a team of engineers and team members to ensure that the digital solutions are highly available, scalable, and secure and will be responsible for automating routine tasks, improving the solution's performance, and providing technical support to other teams.

What you will deliver

  • Ensure the reliability, performance, and scalability of large-scale, cloud-based applications and infrastructure.
  • Creating automated solutions to improve operational aspects of the site.
  • Ensure that applications and websites run smoothly and efficiently.
  • Detect issues and automatically managing failures to keep systems up and running.
  • Work with software developers, engineers, and operations teams to improve system performance.
  • Analyse incidents to prevent future disruptions.
  • Develop and maintain standardised solutions that can be reused across multiple teams and projects.

What you will need to be successful (experience and qualifications)

Technical skills we need from you

  • A bachelor's degree in computer science, engineering, or a related field or equivalent work experience.
  • Relevant certifications (e.g., AWS / Azure cloud engineering, fundamentals, DevOps, architect certifications) can be beneficial. Knowledge of networking concepts, protocols, and tools, willingness to learn new technologies and adapt to changing environments.
  • Skilled in managing configuration, deployments, observability, handling and resolving incidents, including root cause analysis, managing and operating complex systems for scalability, availability and performance.
  • Proficient in communication and collaboration skills to work effectively with development and operations teams.

Software Skills

  • Proficient in C# and TypeScript; comfortable working across platforms.
  • Managed large monorepo's and build systems like Bazel.
  • Skilled in writing secure, stable, testable, and maintainable code.
  • Familiar with systems design principles.
  • 2+ years of software development experience, ideally in platform or service engineering.
  • Familiar with software engineering best practices across the full SDLC---coding standards, code reviews, source control, CI/CD, testing, and operations.
  • Experience supporting and operating production systems, with exposure to monitoring, logging, alerting, and basic security practices.
  • Experience designing and contributing to Internal Developer Platforms (IDPs) to streamline developer workflows and self-service capabilities.

Infrastructure Skills

  • Skilled knowledge of Linux/Unix systems, including system configuration, networking, and debugging.
  • Expert in building and scaling infrastructure services using Amazon Web Services or Microsoft Azure.
  • Skilled with infrastructure tools like Kubernetes, Istio, EKS, Kafka. Experience in Terraform, Ansible, Puppet, Chef, for infrastructure as code, monitoring tools (e.g., Prometheus, Grafana) and logging systems (e.g., ELK stack).
  • Skilled in the understanding of using core cloud application infrastructure services including identity platforms, networking, storage, databases, containers, and serverless.
  • Skilful knowledge of databases, such as relational, graph, document, and key-value, including performance tuning and improvement

We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform crucial job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.


Travel Requirement

No travel is expected with this role


Relocation Assistance:

This role is not eligible for relocation


Remote Type:

This position is a hybrid of office/remote working


Skills:

Agility core practices, Agility core practices, Analytics, API and platform design, Business Analysis, Cloud Platforms, Coaching, Communication, Configuration management and release, Continuous deployment and release, Data Structures and Algorithms (Inactive), Digital Project Management, Documentation and knowledge sharing, Facilitation, Information Security, iOS and Android development, Mentoring, Metrics definition and instrumentation, NoSql data modelling, Relational Data Modelling, Risk Management, Scripting, Service operations and resiliency, Software Design and Development, Source control and code management {+ 4 more}


Legal Disclaimer:

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, socioeconomic status, neurodiversity/neurocognitive functioning, veteran status or disability status. Individuals with an accessibility need may request an adjustment/accommodation related to bp’s recruiting process (e.g., accessing the job application, completing required assessments, participating in telephone screenings or interviews, etc.). If you would like to request an adjustment/accommodation related to the recruitment process, please  contact us .

If you are selected for a position and depending upon your role, your employment may be contingent upon adherence to local policy. This may include pre-placement drug screening, medical review of physical fitness for the role, and background checks.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer - Reigate

RH2 0SG Reigate, South East esure Group

Posted today

Job Viewed

Tap Again To Close

Job Description

Site Reliability Engineer Job Tenure: Full-time, permanentSalary: CompetitiveCompany Description

Ready to join a team that's leading the way in reshaping the future of insurance? Here at esure Group, we are on a mission to revolutionise insurance for good!

We’ve been providing Home and Motor Insurance since 2000, with over 2 million customers trusting us to keep them covered through our esure and Sheilas’ Wheels brands. With a bold dedication for digital innovation, we're transforming the way the industry operates and putting customers at the heart of everything we do.

Having completed our recent multi-year digital transformation, we’re now leveraging advanced technology and data-driven insights alongside exceptional service, to deliver personalised experiences that meet our customers ever-changing needs today and in the future.

Job Description

We are currently recruiting for a Site Reliability Engineer to join our Tech Enablement function.

The successful candidate will be responsible for our monitoring estate, and for the continuous improvements and maintenance of it, and to assist in incident investigation and resolution when required. They also share skills within our Tech Enablement team and should be an evangelist for SRE techniques and goals to the broader IT community.

What you’ll do:

  • Deliver proactive and reactive activities to meet SLAs and availability.
  • Partner with development squads pre-launch to embed monitoring best practices.
  • Support application infrastructure to reduce risks, inefficiencies, and service issues.
  • Provide incident support during office hours and on-call when required.
  • Build strong relationships with Agile squads, DevOps, and wider technology teams
  • Maintain and update monitoring platforms to ensure reliable, consistent operations.
  • Collaborate with teams to enhance monitoring and improve alerting capabilities.
  • Identify monitoring gaps and propose solutions to strengthen performance and resilience.
Qualifications

What we’d love you to bring:

  • Experience of AWS (particularly EC2, EKS, Lambda, S3, IAM, etc)
  • Monitoring / alerting tools (for example we use Grafana, Prometheus, Loki, CloudWatch and Dynatrace)
  • Knowledge of monitoring best practices for a variety of different platforms and technologies
  • Docker and Kubernetes
  • Git/Gitlab
  • Jenkins / CI/CD /ArgoCD
  • Jira and Confluence
  • Scripting or coding with shell/bash/python
  • Terraform
  • Able to assist in troubleshooting complex issues involving multiple platforms and technologies
  • Using agile principles and ways of working
  • The ability to manage and track multiple workstreams simultaneously
Additional Information

What’s in it for you?:

  • Competitive salary that reflects your skills, experience and potential.
  • Discretionary bonus scheme that recognises your hard work and contributions to esure’s success.
  • 25 days annual leave, plus 8 flexible days and the ability to buy and sell further holiday.
  • Our flexible benefits platform is loaded with perks to choose from, so you can build a personal toolkit to support your health, wellbeing, lifestyle, and finances.
  • Company funded private medical insurance for qualifying colleagues.
  • Fantastic discounts on our insurance products! 50% off for yourself and spouse/partner and 10% off for direct family members.
  • We’ll elevate your career with hands-on training, mentoring, access to our exclusive academies, regular career conversations, and expert partner resources.
  • Driving good in the world couldn’t be more important to us. Our colleagues can use 2 volunteering days per year to support their local communities.
  • Join our internal networks and communities to connect, learn, and share ideas with likeminded colleagues.
  • We’re a proud supporter of the ABI’s ‘Make Flexible Work’ campaign and welcome you to ask about the flexibility you need. Our hybrid working approach also puts you in the driving seat of how and where you do your best work.
  • And much more; See a full overview of our benefits here Reward and benefits | Esure Group PLC

We are committed to creating an inclusive and diverse workplace where everyone feels valued, respected, and empowered. We celebrate individuality and create spaces where unique backgrounds and experiences can come together. We believe that diverse perspectives drive innovation, in turn enabling us to better serve our customers, community and build a stronger organisation. Our commitment to inclusion extends to every part of our business, from hiring practices to professional growth opportunities, ensuring equal access and support for all.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer II

London, London American Express Global Business Travel

Posted 2 days ago

Job Viewed

Tap Again To Close

Job Description

Amex GBT is a place where colleagues find inspiration in travel as a force for good and - through their work - can make an impact on our industry. We're here to help our colleagues achieve success and offer an inclusive and collaborative culture where your voice is valued.
**What You'll Do on a Typical Day:**
+ Design and implement next-generation highly scalable, and reliable applications using SaaS technology.
+ Translate functional specifications into logical, component-based technical designs.
+ Own delivery of application features end to end by working with internal and external teams.
+ Innovate and implement new ideas to solve complex software problems.
+ Work closely with geographically distributed team members
**What We're Looking For:**
+ Amex GBT Egencia's Technology organisation is looking for a highly motivated, self-driven, self-starter, and fast-growing potential individual to be part of a growing team of technologists. You are well-versed in SDLC and Agile methodologies.
+ You have at least 1-3 years of experience in software development and troubleshooting.
+ An independent thinker, who works around problems and who isn't shy of trying new technologies. You have validated experience working in parallel technologies apart from your core technology area (Java).
+ Prior experience in working harmoniously with a cross-geography team will be an added advantage. You should be equally appropriate in development, test, and debugging roles and be ready to wear many hats. This team values "fail-fast" learners and technology enthusiasts who view learning new technology as a fun experience.
+ Strong knowledge of Object Oriented Programming, Data Structures, and Algorithms
+ Good proficiency in any of the programming languages from Java, Golang, Python, or Bash
+ Proven ability to develop and support large-sized highly scalable software systems
+ Experience in AWS Services
+ Good knowledge of container orchestration frameworks primarily Kubernetes
+ Basic understanding of logging and monitoring frameworks
+ Knowledge of cloud computing concepts along with an understanding of application communication and routing is a plus
+ Good experience in developing and deploying AWS cloud-based platforms
+ Good understanding of network topologies with experience in hybrid cloud architecture will be a plus
+ Experience with the Agile Tool set and Programming Practices
+ Knowledge of CI-CD principles
+ Knowledge of server-side design patterns is a plus
+ Ability to quickly pick up new technologies, and languages with ease
+ A standout colleague who collaborates and incorporates feedback from all partners
+ Excellent written and verbal communication skills
+ BS or MS in Computer Science or equivalent degree
#GBTJobs
**Location**
London, United Kingdom
**The #TeamGBT Experience**
Work and life: Find your happy medium at Amex GBT.
+ **Flexible benefits** are tailored to each country and start the day you do. These include health and welfare insurance plans, retirement programs, parental leave, adoption assistance, and wellbeing resources to support you and your immediate family.
+ **Travel perks:** get a choice of deals each week from major travel providers on everything from flights to hotels to cruises and car rentals.
+ **Develop the skills you want** when the time is right for you, with access to over 20,000 courses on our learning platform, leadership courses, and new job openings available to internal candidates first.
+ **We strive to champion Inclusion** in every aspect of our business at Amex GBT. You can connect with colleagues through our global INclusion Groups, centered around common identities or initiatives, to discuss challenges, obstacles, achievements, and drive company awareness and action.
+ And much more!
All applicants will receive equal consideration for employment without regard to age, sex, gender (and characteristics related to sex and gender), pregnancy (and related medical conditions), race, color, citizenship, religion, disability, or any other class or characteristic protected by law.
Click Here ( for Additional Disclosures in Accordance with the LA County Fair Chance Ordinance.
Furthermore, we are committed to providing reasonable accommodation to qualified individuals with disabilities. Please let your recruiter know if you need an accommodation at any point during the hiring process. For details regarding how we protect your data, please consult the Amex GBT Recruitment Privacy Statement ( .
**What if I don't meet every requirement?** If you're passionate about our mission and believe you'd be a phenomenal addition to our team, don't worry about "checking every box;" please apply anyway. You may be exactly the person we're looking for!
Click Here to Learn More (
This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Devops engineers Jobs in London !

Principal Site Reliability Engineer

London, London Orgvue

Posted 27 days ago

Job Viewed

Tap Again To Close

Job Description

Permanent

Orgvue is a leading organizational design and planning software platform that captures the power of data visualization and modelling to build more adaptable, and better performing organizations. HR, finance and business leaders use Orgvue for actionable insight and analysis that helps them make faster workforce decisions in a constantly changing world.

Orgvue is used by the world’s largest and best-known enterprises and management consulting firms to visualize and confidently build the businesses they want tomorrow, today. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney.

We are seeking a Principal Site Reliability Engineer who will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure.

Role

In this role you will work across product, platform, and operations teams to ensure our systems are reliable, observable, and resilient, even at scale.

This role combines hands-on technical capability with strategic vision, helping us build a world-class reliability culture and a robust engineering foundation for growth. We're looking for someone who has technical expertise, is a great communicator and enjoys collaborating across multiple teams.

Responsibilities

  • Define and enforce SLOs, SLIs, and error budgets across critical services
  • Crafting and implementing a cloud infrastructure and tooling strategy       
  • Work across our Org to level up SRE practices
  • Help implement robust observability metrics, logs & traces using our observability tool
  • Guide the team in building automated, self-healing systems
  • Own and evolve our incident response processes, including on-call practices and post-mortem culture
  • Mentor engineers across the org on best practices in reliability, operational readiness, and scalable infrastructure
  • Drive Infrastructure as Code (IaC) using Terraform, Kubernetes, CloudFormation and GitOps practices
  • Collaborate closely with security, DevOps, and software teams to ensure compliance, scalability, and operational excellence
  • Evaluate and introduce tools, patterns, and practices that improve the performance and reliability of our SaaS platform

Requirements

  • Demonstrable experience leading SRE transformations
  • Deep hands-on expertise with Kubernetes (EKS preferred) in production environments
  • Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)
  • Expert in Infrastructure as Code using tools such as Terraform , with knowledge of GitOps workflows
  • Strong background in observability: metrics, visualization, logging, and tracing
  • Understanding of automation, SDLC, CI/CD pipelines, deployment automation, and blue/green or canary releases
  • Proven experience with incident management, disaster recovery planning, root cause analysis, and post-incident reviews

Benefits

  • Hybrid working - 1+ days a week in the London office
  • Wellbeing: Sanctus Coaching, Virtual fitness sessions, Wellbeing webinars, Annual Wellbeing day
  • Subsidised Gym Membership
  • Private Medical Insurance (including Dental and Vision) and Life Assurance
  • 25 days holiday (increasing to 30 days at a rate of 1 extra day per year)
  • Summer Fridays (half-day Fridays for the months of July and August)
  • Employer pension contribution of 5% of your gross salary, if you contribute a minimum of 3%
  • Season ticket Loan
  • Cycle to Work Scheme
  • Annual Discretionary Bonus

'Here at Orgvue we promote individualism and a diverse workforce to build on our future success'

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer - Remote

London, London ESL FACEIT Group

Posted 278 days ago

Job Viewed

Tap Again To Close

Job Description

Permanent

At EFG (ESL FACEIT Group) we create worlds beyond gameplay where players and fans become community. We pride ourselves in having a corporate social responsibility which is that “IT’S NOT GG (Good Game), UNTIL IT’S GG FOR ALL”. We are passionate about the culture we foster that ultimately helps to create and shape the world of esports, gaming tournaments, leagues, events and holistic ecosystems staged for our millions of players, fans and heroes.

The Team:

As a Site Reliability Engineer at EFG, you will be designing, analyzing, and troubleshooting large-scale distributed systems. You will demonstrate a systematic problem-solving approach, and the ability to debug and optimize code and to automate routine tasks. You will ensure that EFG’s services and systems are reliable, that they have uptime appropriate to users' needs and they have a fast rate of improvement. 

Apart from monitoring our systems' capacity and performance, you will also focus on optimizing existing systems, on building infrastructure and on eliminating work through automation.  You will work collaboratively with the software engineering teams to deploy and operate our systems, and you will help to automate and streamline our operations and processes. Within this role, you will be given real responsibilities, and you have the opportunity to drive change and have a big impact on our products and platform.

What you will do:

  • Maintaining and improving the monitoring and observability tools (Grafana/Prometheus/Thanos/Jaeger);
  • Working closely with your team and with other cross-functional teams to help design, maintain and operate systems at scale;
  • Developing and driving adoption of SRE best practices across the company;
  • Leading on incident management process and adoption;
  • Using your troubleshooting skills to help identify and fix operational issues;
  • Working with Cloud Native technologies such as Kubernetes, Envoy, Istio, Prometheus and Helm;
  • Working with the “Hashi Stack” (terraform, packer, vault);
  • Experimenting with and introducing cutting edge technologies.

Requirements

  • Proven experience as a Site Reliability Engineer, DevXP Engineer or Software Engineer, focusing on building and maintaining scalable infrastructures;
  • Excellent working knowledge on at least one of the major cloud providers (GCP/AWS/Azure);
  • You have experience with cluster management systems (Kubernetes);
  • Knowledge of incident management: ability to investigate, troubleshoot, recover and prevent the recurrence of incidents that interfere with the normal delivery of IT services;
  • Proficient in Go language and some level of proficiency in at least another language: Java, Python, Rust…;
  • You have knowledge of GitOps practices;
  • You have production scale experience with one of the following; MongoDB, Redis, MySQL;
  • Experience contributing to open source technologies would be an added bonus.
This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer - Reigate

RH2 0SG Reigate, South East esure Group

Posted today

Job Viewed

Tap Again To Close

Job Description

Senior Site Reliability Engineer Job Tenure: Full-time, permanentSalary: CompetitiveCompany Description

Ready to join a team that's leading the way in reshaping the future of insurance? Here at esure Group, we are on a mission to revolutionise insurance for good!

We’ve been providing Home and Motor Insurance since 2000, with over 2 million customers trusting us to keep them covered through our esure and Sheilas’ Wheels brands. With a bold dedication for digital innovation, we're transforming the way the industry operates and putting customers at the heart of everything we do.

Having completed our recent multi-year digital transformation, we’re now leveraging advanced technology and data-driven insights alongside exceptional service, to deliver personalised experiences that meet our customers ever-changing needs today and in the future.

Job Description

We are currently recruiting for a Senior Site Reliability Engineer to join our Tech Enable team.

As a Lead Engineer for Site Reliability, you must demonstrate various skills to effectively lead and engage in SRE practices. The successful candidate will act as a point of escalation for critical issues, applying technical expertise to promptly address complex problems in collaboration with additional teams.

What you’ll do:

  • Serve as the SRE Lead's backup, assuming leadership duties when necessary to maintain the continuity and efficiency of SRE operations.
  • Provide day-to-day guidance, support, and informed decision-making for the team, maintaining stability and direction.
  • Serve as a subject matter expert, shaping technical direction, leading initiatives, and mentoring colleagues to build team capability.
  • Stay up to date with emerging technologies and industry trends, sharing knowledge across company communities to embed SRE best practice.
  • Drive continual improvement by automating manual processes and optimising monitoring systems to achieve full estate coverage.
  • Lead initiatives to improve availability, performance, and scalability through proactive monitoring, capacity planning, and ongoing maintenance.
  • Collaborate with development squads to embed monitoring, reliability, and scalability best practices within the development lifecycle.
  • Represent the SRE team in stakeholder engagements, providing progress updates, managing expectations, and addressing concerns.
  • Operate as a primary contact for pressing issues, employing technical skills to solve complex problems rapidly in coordination with other teams.
  • Participate in out-of-hours on-call or standby duties when required.
Qualifications

What we’d love you to bring:

  • Deep experience of AWS (particularly EC2, EKS, Lambda, S3, IAM, etc)
  • Monitoring / alerting tools (for example we use Grafana, Prometheus, Loki, CloudWatch and Dynatrace)
  • SME on monitoring best practices for a variety of different platforms and technologies
  • Docker and Kubernetes
  • Git/Gitlab
  • Jenkins / CI/CD/ArgoCD
  • Jira and Confluence
  • Scripting or coding with shell/bash/python
  • Terraform
  • Able to assist in troubleshooting complex issues involving multiple platforms and technologies
  • Using agile principles and ways of working
  • Familiarity with best practice for cloud hosted architectures and solutions
  • The ability to manage and track multiple workstreams simultaneously
Additional Information

What’s in it for you?:

  • Competitive salary that reflects your skills, experience and potential.
  • Discretionary bonus scheme that recognises your hard work and contributions to esure’s success.
  • 25 days annual leave, plus 8 flexible days and the ability to buy and sell further holiday.
  • Our flexible benefits platform is loaded with perks to choose from, so you can build a personal toolkit to support your health, wellbeing, lifestyle, and finances.
  • Company funded private medical insurance for qualifying colleagues.
  • Fantastic discounts on our insurance products! 50% off for yourself and spouse/partner and 10% off for direct family members.
  • We’ll elevate your career with hands-on training, mentoring, access to our exclusive academies, regular career conversations, and expert partner resources.
  • Driving good in the world couldn’t be more important to us. Our colleagues can use 2 volunteering days per year to support their local communities.
  • Join our internal networks and communities to connect, learn, and share ideas with likeminded colleagues.
  • We’re a proud supporter of the ABI’s ‘Make Flexible Work’ campaign and welcome you to ask about the flexibility you need. Our hybrid working approach also puts you in the driving seat of how and where you do your best work.
  • And much more; See a full overview of our benefits here Reward and benefits | Esure Group PLC

We are committed to creating an inclusive and diverse workplace where everyone feels valued, respected, and empowered. We celebrate individuality and create spaces where unique backgrounds and experiences can come together. We believe that diverse perspectives drive innovation, in turn enabling us to better serve our customers, community and build a stronger organisation. Our commitment to inclusion extends to every part of our business, from hiring practices to professional growth opportunities, ensuring equal access and support for all.

This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Devops Engineers Jobs View All Jobs in London