104 Devops Engineers jobs in London
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Senior Site Reliability Engineer
London - Hybrid
80,000 - 90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension
Excellent opportunity for Site Reliability Engineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression!
This company operates at the forefront of digital transformation, delivering a unified platform built for scalability, resilience, and performance. With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries.
In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems.
The ideal candidate will be an experienced Site Reliability Engineer with a deep background in AWS, Kubernetes (EKS), Terraform, and monitoring/eventing tools. You'll have a strong grasp of application-level troubleshooting, chaos engineering, and performance tuning.
This is a fantastic opportunity to work in a modern DevOps environment where innovation is encouraged, personal development is supported, and technical impact is real.
The Role:
*Manage and optimise AWS and Kubernetes (EKS) infrastructure
*Implement resilience strategies and conduct chaos engineering experiments
*Monitor and maintain Kafka clusters for performance and reliability
*Respond to and resolve application-level production incidents
The Person:
*5+ years in SRE, DevOps, or infrastructure engineering
*Strong experience with AWS, EKS/Kubernetes, and Terraform
*Familiar with Kafka and observability tools like Datadog or Grafana
*Able to troubleshoot issues across infrastructure and application layers
Reference number: BBBH(phone number removed)
To apply for this role or for to be considered for further roles, please click "Apply Now" or contact Tommy Williams at Rise Technical Recruitment.
Rise Technical Recruitment Ltd acts an employment agency for permanent roles and an employment business for temporary roles.
The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set. We are an equal opportunities employer and welcome applications from all suitable candidates.
Site Reliability Engineer

Posted 2 days ago
Job Viewed
Job Description
Come join a team that is striving for operational awesomeness and trying to automate the world. We have a large presence with large cloud vendors. You should have experience with architecture, deployments, and networking in one or more of the major industry vendors. This is an incredible opportunity to use your existing cloud experience and drive the growth of Splunk Cloud.
**What we're looking for**
**NOTE:** **4 x 10h shifts: Wednesday - Saturday/8am-6pm**
We are looking for a TechOps SRE to help maintain, contribute to and improve the next generation of our large scale Cloud offering. You will be working with providers and supporting the infrastructure that powers Splunk's cloud offering.
**You should apply if**
+ **you are comfortable working 4 x 10h shifts: Wednesday - Saturday/8am-6pm**
+ You have operational experience at scale. You have had hands-on roles that deal with operating systems (particularly Linux) and networking. You might also have worked with Cloud technologies. Your previous job titles might be something close to systems admin, network engineer or devops engineer.
+ You're passionate about your work. Our customers are passionate about Splunk and we want the same from our engineers. You should enjoy actively being responsible for your work and be excited about your projects.
+ You love large complex systems. Experience in working on distributed systems or a passion for finding edge cases that appear at scale. You are interested in how to bring something from a small one off task to how to implement it across several thousand machines at once.
+ You have some development skills. We have code in several languages, ranging from Python and Shell to Go and C++. We don't expect you to be a software engineer but you should be familiar with basic programming and understand concepts like input sanitisation and unit testing.
+ "How can I automate this process?" is a question you constantly ask yourself.
+ Data drives your decisions. Data excites you and you make decisions based on numbers rather than assumptions. If an issue arises, you strive to be alerted before our customers notice.
+ You care about monitoring. Shipping code often and getting useful feedback excites you and you're not worried about changing direction when a solution isn't working as expected.
**What we provide**
+ Opportunities to develop and grow as an engineer. We are always expanding into new areas, working with open-source projects and contributing back, and exploring new technologies.
+ A team of incredibly capable and dedicated peers, all the way from engineering to product management and customer support.
+ Breadth and depth. You are interested to work in an area that dynamically scales to meet the need of Splunk's cloud offering. You want to go deep into optimizing how we automate every manual process and tedious task we encounter.
+ Growth and mentorship. We believe in growing engineers through ownership and leadership opportunities. We also believe that mentors help both sides of the equation.
+ A stable, collaborative, and supportive work environment. Honesty and collaboration are values we see as a core part of our team identity. We understand the value in open communication-working together to get things done, and to adapt to the changing needs of the team and individuals. This is reflected in both our internal communications and also in how we interact with our customers.
+ Balance. We don't expect people to work 12 hour days. We want you to be successful outside of work too. We trust our colleagues to be responsible with their time and commitment, and believe that balance helps cultivate a positive environment.
Splunk, a Cisco company, is an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis.
Site Reliability Engineer
Posted 5 days ago
Job Viewed
Job Description
Gizmo is an AI startup on a mission to make learning so easy that anyone can learn anything. We're building Duolingo for anything - a platform that uses gamification and social mechanics to make learning fun.
With over 1 million monthly active users and $4M in annual recurring revenue, we’re already one of the fastest-growing startups in the UK. Backed by leading investors, we recently raised $22M in Series A funding to accelerate our vision of helping 1 billion people learn.
Role Overview
Reporting to the CTO, you will own capacity, performance and reliability for Gizmo’s full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You’ll write code across the stack, but your charter is classic SRE: defend SLOs , eliminate toil , and raise the ceiling on scale before it becomes a hard limit.
Key Responsibilities
- Define SLIs/SLOs for latency, availability and error rate; codify error budgets and partner with product teams on trade-offs.
- Perform load-testing, capacity modelling and up-front scalability design for PostgreSQL, OpenSearch, Redis, Hasura and CF Workers; produce data-driven scaling plans.
- Extend metrics, structured logging and tracing; establish alert rules that page only on user-visible impact; build actionable runbooks .
- Join the on-call rotation, lead blameless post-mortems, drive remediation work to closure and track MTTR/MTBF improvements.
- Automate repetitive ops on Kubernetes and CI/CD; keep “toil” <50 % of your time by pushing fixes into code.
- Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an SRE playbook
Requirements
- Hands-on scale experience : you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL).
- Strong backend fundamentals around concurrency, caching, indexing and distributed systems trade-offs.
- Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts.
- Comfort with Kubernetes , IaC and cloud-native patterns; can debug from network to application layer.
- Start-up bias for action: you prioritise high-leverage fixes, ship iteratively and own outcomes end-to-end.
- Collaborative and feedback-driven; you welcome post-mortem culture and continuous improvement.
- Driven by impact - you prioritise work that moves the needle!
Nice-to-haves: experience with Hasura internals, Cloudflare Workers edge optimisation, or operating OpenSearch at scale.
Benefits
- Highly competitive salary.
- You'll own a piece of what you're building - equity included.
- Hybrid working model with 4 days in our East London office, ideally located between Shoreditch High Street, Old Street, and Liverpool Street stations.
- The opportunity to become one of the earliest employees in one of the UK’s fastest-growing startups.
- Private health insurance
Site Reliability Engineer
Posted 592 days ago
Job Viewed
Job Description
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
Entity:
Technology
Job Family Group:
Job Description:
bp is a global energy business with a purpose to reimagine energy for people and our planet. We aim to be a very different kind of energy company by 2030, helping the world reach net zero and improving people’s lives. We are committed to creating a diverse and inclusive environment where everyone can thrive. Join bp and become part of the team building our future!
You will work with
You will work as a member of a high-energy, top-performing team of engineers, working alongside technology leaders to shape the vision and drive the execution of ground-breaking compute and data platforms that make a real impact.
Let me tell you about the role
As a senior site reliability engineer, you will be responsible for building, maintaining and operating the software solutions, infrastructure and services that powers technology platforms. In this role, we work with a team of engineers and team members to ensure that the digital solutions are highly available, scalable, and secure and will be responsible for automating routine tasks, improving the solution's performance, and providing technical support to other teams.
What you will deliver
- Ensure the reliability, performance, and scalability of large-scale, cloud-based applications and infrastructure.
- Creating automated solutions to improve operational aspects of the site.
- Ensure that applications and websites run smoothly and efficiently.
- Detect issues and automatically managing failures to keep systems up and running.
- Work with software developers, engineers, and operations teams to improve system performance.
- Analyse incidents to prevent future disruptions.
- Develop and maintain standardised solutions that can be reused across multiple teams and projects.
What you will need to be successful (experience and qualifications)
Technical skills we need from you
- A bachelor's degree in computer science, engineering, or a related field or equivalent work experience.
- Relevant certifications (e.g., AWS / Azure cloud engineering, fundamentals, DevOps, architect certifications) can be beneficial. Knowledge of networking concepts, protocols, and tools, willingness to learn new technologies and adapt to changing environments.
- Skilled in managing configuration, deployments, observability, handling and resolving incidents, including root cause analysis, managing and operating complex systems for scalability, availability and performance.
- Proficient in communication and collaboration skills to work effectively with development and operations teams.
Software Skills
- Proficient in C# and TypeScript; comfortable working across platforms.
- Managed large monorepo's and build systems like Bazel.
- Skilled in writing secure, stable, testable, and maintainable code.
- Familiar with systems design principles.
- 2+ years of software development experience, ideally in platform or service engineering.
- Familiar with software engineering best practices across the full SDLC---coding standards, code reviews, source control, CI/CD, testing, and operations.
- Experience supporting and operating production systems, with exposure to monitoring, logging, alerting, and basic security practices.
- Experience designing and contributing to Internal Developer Platforms (IDPs) to streamline developer workflows and self-service capabilities.
Infrastructure Skills
- Skilled knowledge of Linux/Unix systems, including system configuration, networking, and debugging.
- Expert in building and scaling infrastructure services using Amazon Web Services or Microsoft Azure.
- Skilled with infrastructure tools like Kubernetes, Istio, EKS, Kafka. Experience in Terraform, Ansible, Puppet, Chef, for infrastructure as code, monitoring tools (e.g., Prometheus, Grafana) and logging systems (e.g., ELK stack).
- Skilled in the understanding of using core cloud application infrastructure services including identity platforms, networking, storage, databases, containers, and serverless.
- Skilful knowledge of databases, such as relational, graph, document, and key-value, including performance tuning and improvement
We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform crucial job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.
Travel Requirement
Relocation Assistance:
Remote Type:
Skills:
Legal Disclaimer:
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, socioeconomic status, neurodiversity/neurocognitive functioning, veteran status or disability status. Individuals with an accessibility need may request an adjustment/accommodation related to bp’s recruiting process (e.g., accessing the job application, completing required assessments, participating in telephone screenings or interviews, etc.). If you would like to request an adjustment/accommodation related to the recruitment process, please contact us .
If you are selected for a position and depending upon your role, your employment may be contingent upon adherence to local policy. This may include pre-placement drug screening, medical review of physical fitness for the role, and background checks.
Site Reliability Engineer - Reigate
Posted today
Job Viewed
Job Description
Ready to join a team that's leading the way in reshaping the future of insurance? Here at esure Group, we are on a mission to revolutionise insurance for good!
We’ve been providing Home and Motor Insurance since 2000, with over 2 million customers trusting us to keep them covered through our esure and Sheilas’ Wheels brands. With a bold dedication for digital innovation, we're transforming the way the industry operates and putting customers at the heart of everything we do.
Having completed our recent multi-year digital transformation, we’re now leveraging advanced technology and data-driven insights alongside exceptional service, to deliver personalised experiences that meet our customers ever-changing needs today and in the future.
Job DescriptionWe are currently recruiting for a Site Reliability Engineer to join our Tech Enablement function.
The successful candidate will be responsible for our monitoring estate, and for the continuous improvements and maintenance of it, and to assist in incident investigation and resolution when required. They also share skills within our Tech Enablement team and should be an evangelist for SRE techniques and goals to the broader IT community.
What you’ll do:
- Deliver proactive and reactive activities to meet SLAs and availability.
- Partner with development squads pre-launch to embed monitoring best practices.
- Support application infrastructure to reduce risks, inefficiencies, and service issues.
- Provide incident support during office hours and on-call when required.
- Build strong relationships with Agile squads, DevOps, and wider technology teams
- Maintain and update monitoring platforms to ensure reliable, consistent operations.
- Collaborate with teams to enhance monitoring and improve alerting capabilities.
- Identify monitoring gaps and propose solutions to strengthen performance and resilience.
What we’d love you to bring:
- Experience of AWS (particularly EC2, EKS, Lambda, S3, IAM, etc)
- Monitoring / alerting tools (for example we use Grafana, Prometheus, Loki, CloudWatch and Dynatrace)
- Knowledge of monitoring best practices for a variety of different platforms and technologies
- Docker and Kubernetes
- Git/Gitlab
- Jenkins / CI/CD /ArgoCD
- Jira and Confluence
- Scripting or coding with shell/bash/python
- Terraform
- Able to assist in troubleshooting complex issues involving multiple platforms and technologies
- Using agile principles and ways of working
- The ability to manage and track multiple workstreams simultaneously
What’s in it for you?:
- Competitive salary that reflects your skills, experience and potential.
- Discretionary bonus scheme that recognises your hard work and contributions to esure’s success.
- 25 days annual leave, plus 8 flexible days and the ability to buy and sell further holiday.
- Our flexible benefits platform is loaded with perks to choose from, so you can build a personal toolkit to support your health, wellbeing, lifestyle, and finances.
- Company funded private medical insurance for qualifying colleagues.
- Fantastic discounts on our insurance products! 50% off for yourself and spouse/partner and 10% off for direct family members.
- We’ll elevate your career with hands-on training, mentoring, access to our exclusive academies, regular career conversations, and expert partner resources.
- Driving good in the world couldn’t be more important to us. Our colleagues can use 2 volunteering days per year to support their local communities.
- Join our internal networks and communities to connect, learn, and share ideas with likeminded colleagues.
- We’re a proud supporter of the ABI’s ‘Make Flexible Work’ campaign and welcome you to ask about the flexibility you need. Our hybrid working approach also puts you in the driving seat of how and where you do your best work.
- And much more; See a full overview of our benefits here Reward and benefits | Esure Group PLC
We are committed to creating an inclusive and diverse workplace where everyone feels valued, respected, and empowered. We celebrate individuality and create spaces where unique backgrounds and experiences can come together. We believe that diverse perspectives drive innovation, in turn enabling us to better serve our customers, community and build a stronger organisation. Our commitment to inclusion extends to every part of our business, from hiring practices to professional growth opportunities, ensuring equal access and support for all.
Site Reliability Engineer II

Posted 2 days ago
Job Viewed
Job Description
**What You'll Do on a Typical Day:**
+ Design and implement next-generation highly scalable, and reliable applications using SaaS technology.
+ Translate functional specifications into logical, component-based technical designs.
+ Own delivery of application features end to end by working with internal and external teams.
+ Innovate and implement new ideas to solve complex software problems.
+ Work closely with geographically distributed team members
**What We're Looking For:**
+ Amex GBT Egencia's Technology organisation is looking for a highly motivated, self-driven, self-starter, and fast-growing potential individual to be part of a growing team of technologists. You are well-versed in SDLC and Agile methodologies.
+ You have at least 1-3 years of experience in software development and troubleshooting.
+ An independent thinker, who works around problems and who isn't shy of trying new technologies. You have validated experience working in parallel technologies apart from your core technology area (Java).
+ Prior experience in working harmoniously with a cross-geography team will be an added advantage. You should be equally appropriate in development, test, and debugging roles and be ready to wear many hats. This team values "fail-fast" learners and technology enthusiasts who view learning new technology as a fun experience.
+ Strong knowledge of Object Oriented Programming, Data Structures, and Algorithms
+ Good proficiency in any of the programming languages from Java, Golang, Python, or Bash
+ Proven ability to develop and support large-sized highly scalable software systems
+ Experience in AWS Services
+ Good knowledge of container orchestration frameworks primarily Kubernetes
+ Basic understanding of logging and monitoring frameworks
+ Knowledge of cloud computing concepts along with an understanding of application communication and routing is a plus
+ Good experience in developing and deploying AWS cloud-based platforms
+ Good understanding of network topologies with experience in hybrid cloud architecture will be a plus
+ Experience with the Agile Tool set and Programming Practices
+ Knowledge of CI-CD principles
+ Knowledge of server-side design patterns is a plus
+ Ability to quickly pick up new technologies, and languages with ease
+ A standout colleague who collaborates and incorporates feedback from all partners
+ Excellent written and verbal communication skills
+ BS or MS in Computer Science or equivalent degree
#GBTJobs
**Location**
London, United Kingdom
**The #TeamGBT Experience**
Work and life: Find your happy medium at Amex GBT.
+ **Flexible benefits** are tailored to each country and start the day you do. These include health and welfare insurance plans, retirement programs, parental leave, adoption assistance, and wellbeing resources to support you and your immediate family.
+ **Travel perks:** get a choice of deals each week from major travel providers on everything from flights to hotels to cruises and car rentals.
+ **Develop the skills you want** when the time is right for you, with access to over 20,000 courses on our learning platform, leadership courses, and new job openings available to internal candidates first.
+ **We strive to champion Inclusion** in every aspect of our business at Amex GBT. You can connect with colleagues through our global INclusion Groups, centered around common identities or initiatives, to discuss challenges, obstacles, achievements, and drive company awareness and action.
+ And much more!
All applicants will receive equal consideration for employment without regard to age, sex, gender (and characteristics related to sex and gender), pregnancy (and related medical conditions), race, color, citizenship, religion, disability, or any other class or characteristic protected by law.
Click Here ( for Additional Disclosures in Accordance with the LA County Fair Chance Ordinance.
Furthermore, we are committed to providing reasonable accommodation to qualified individuals with disabilities. Please let your recruiter know if you need an accommodation at any point during the hiring process. For details regarding how we protect your data, please consult the Amex GBT Recruitment Privacy Statement ( .
**What if I don't meet every requirement?** If you're passionate about our mission and believe you'd be a phenomenal addition to our team, don't worry about "checking every box;" please apply anyway. You may be exactly the person we're looking for!
Click Here to Learn More (
Be The First To Know
About the latest Devops engineers Jobs in London !
Principal Site Reliability Engineer
Posted 27 days ago
Job Viewed
Job Description
Orgvue is a leading organizational design and planning software platform that captures the power of data visualization and modelling to build more adaptable, and better performing organizations. HR, finance and business leaders use Orgvue for actionable insight and analysis that helps them make faster workforce decisions in a constantly changing world.
Orgvue is used by the world’s largest and best-known enterprises and management consulting firms to visualize and confidently build the businesses they want tomorrow, today. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney.
We are seeking a Principal Site Reliability Engineer who will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure.
Role
In this role you will work across product, platform, and operations teams to ensure our systems are reliable, observable, and resilient, even at scale.
This role combines hands-on technical capability with strategic vision, helping us build a world-class reliability culture and a robust engineering foundation for growth. We're looking for someone who has technical expertise, is a great communicator and enjoys collaborating across multiple teams.
Responsibilities
- Define and enforce SLOs, SLIs, and error budgets across critical services
- Crafting and implementing a cloud infrastructure and tooling strategy
- Work across our Org to level up SRE practices
- Help implement robust observability metrics, logs & traces using our observability tool
- Guide the team in building automated, self-healing systems
- Own and evolve our incident response processes, including on-call practices and post-mortem culture
- Mentor engineers across the org on best practices in reliability, operational readiness, and scalable infrastructure
- Drive Infrastructure as Code (IaC) using Terraform, Kubernetes, CloudFormation and GitOps practices
- Collaborate closely with security, DevOps, and software teams to ensure compliance, scalability, and operational excellence
- Evaluate and introduce tools, patterns, and practices that improve the performance and reliability of our SaaS platform
Requirements
- Demonstrable experience leading SRE transformations
- Deep hands-on expertise with Kubernetes (EKS preferred) in production environments
- Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.)
- Expert in Infrastructure as Code using tools such as Terraform , with knowledge of GitOps workflows
- Strong background in observability: metrics, visualization, logging, and tracing
- Understanding of automation, SDLC, CI/CD pipelines, deployment automation, and blue/green or canary releases
- Proven experience with incident management, disaster recovery planning, root cause analysis, and post-incident reviews
Benefits
- Hybrid working - 1+ days a week in the London office
- Wellbeing: Sanctus Coaching, Virtual fitness sessions, Wellbeing webinars, Annual Wellbeing day
- Subsidised Gym Membership
- Private Medical Insurance (including Dental and Vision) and Life Assurance
- 25 days holiday (increasing to 30 days at a rate of 1 extra day per year)
- Summer Fridays (half-day Fridays for the months of July and August)
- Employer pension contribution of 5% of your gross salary, if you contribute a minimum of 3%
- Season ticket Loan
- Cycle to Work Scheme
- Annual Discretionary Bonus
'Here at Orgvue we promote individualism and a diverse workforce to build on our future success'
Site Reliability Engineer - Remote
Posted 278 days ago
Job Viewed
Job Description
At EFG (ESL FACEIT Group) we create worlds beyond gameplay where players and fans become community. We pride ourselves in having a corporate social responsibility which is that “IT’S NOT GG (Good Game), UNTIL IT’S GG FOR ALL”. We are passionate about the culture we foster that ultimately helps to create and shape the world of esports, gaming tournaments, leagues, events and holistic ecosystems staged for our millions of players, fans and heroes.
The Team:
As a Site Reliability Engineer at EFG, you will be designing, analyzing, and troubleshooting large-scale distributed systems. You will demonstrate a systematic problem-solving approach, and the ability to debug and optimize code and to automate routine tasks. You will ensure that EFG’s services and systems are reliable, that they have uptime appropriate to users' needs and they have a fast rate of improvement.
Apart from monitoring our systems' capacity and performance, you will also focus on optimizing existing systems, on building infrastructure and on eliminating work through automation. You will work collaboratively with the software engineering teams to deploy and operate our systems, and you will help to automate and streamline our operations and processes. Within this role, you will be given real responsibilities, and you have the opportunity to drive change and have a big impact on our products and platform.
What you will do:
- Maintaining and improving the monitoring and observability tools (Grafana/Prometheus/Thanos/Jaeger);
- Working closely with your team and with other cross-functional teams to help design, maintain and operate systems at scale;
- Developing and driving adoption of SRE best practices across the company;
- Leading on incident management process and adoption;
- Using your troubleshooting skills to help identify and fix operational issues;
- Working with Cloud Native technologies such as Kubernetes, Envoy, Istio, Prometheus and Helm;
- Working with the “Hashi Stack” (terraform, packer, vault);
- Experimenting with and introducing cutting edge technologies.
Requirements
- Proven experience as a Site Reliability Engineer, DevXP Engineer or Software Engineer, focusing on building and maintaining scalable infrastructures;
- Excellent working knowledge on at least one of the major cloud providers (GCP/AWS/Azure);
- You have experience with cluster management systems (Kubernetes);
- Knowledge of incident management: ability to investigate, troubleshoot, recover and prevent the recurrence of incidents that interfere with the normal delivery of IT services;
- Proficient in Go language and some level of proficiency in at least another language: Java, Python, Rust…;
- You have knowledge of GitOps practices;
- You have production scale experience with one of the following; MongoDB, Redis, MySQL;
- Experience contributing to open source technologies would be an added bonus.
Senior Site Reliability Engineer - Reigate
Posted today
Job Viewed
Job Description
Ready to join a team that's leading the way in reshaping the future of insurance? Here at esure Group, we are on a mission to revolutionise insurance for good!
We’ve been providing Home and Motor Insurance since 2000, with over 2 million customers trusting us to keep them covered through our esure and Sheilas’ Wheels brands. With a bold dedication for digital innovation, we're transforming the way the industry operates and putting customers at the heart of everything we do.
Having completed our recent multi-year digital transformation, we’re now leveraging advanced technology and data-driven insights alongside exceptional service, to deliver personalised experiences that meet our customers ever-changing needs today and in the future.
Job DescriptionWe are currently recruiting for a Senior Site Reliability Engineer to join our Tech Enable team.
As a Lead Engineer for Site Reliability, you must demonstrate various skills to effectively lead and engage in SRE practices. The successful candidate will act as a point of escalation for critical issues, applying technical expertise to promptly address complex problems in collaboration with additional teams.
What you’ll do:
- Serve as the SRE Lead's backup, assuming leadership duties when necessary to maintain the continuity and efficiency of SRE operations.
- Provide day-to-day guidance, support, and informed decision-making for the team, maintaining stability and direction.
- Serve as a subject matter expert, shaping technical direction, leading initiatives, and mentoring colleagues to build team capability.
- Stay up to date with emerging technologies and industry trends, sharing knowledge across company communities to embed SRE best practice.
- Drive continual improvement by automating manual processes and optimising monitoring systems to achieve full estate coverage.
- Lead initiatives to improve availability, performance, and scalability through proactive monitoring, capacity planning, and ongoing maintenance.
- Collaborate with development squads to embed monitoring, reliability, and scalability best practices within the development lifecycle.
- Represent the SRE team in stakeholder engagements, providing progress updates, managing expectations, and addressing concerns.
- Operate as a primary contact for pressing issues, employing technical skills to solve complex problems rapidly in coordination with other teams.
- Participate in out-of-hours on-call or standby duties when required.
What we’d love you to bring:
- Deep experience of AWS (particularly EC2, EKS, Lambda, S3, IAM, etc)
- Monitoring / alerting tools (for example we use Grafana, Prometheus, Loki, CloudWatch and Dynatrace)
- SME on monitoring best practices for a variety of different platforms and technologies
- Docker and Kubernetes
- Git/Gitlab
- Jenkins / CI/CD/ArgoCD
- Jira and Confluence
- Scripting or coding with shell/bash/python
- Terraform
- Able to assist in troubleshooting complex issues involving multiple platforms and technologies
- Using agile principles and ways of working
- Familiarity with best practice for cloud hosted architectures and solutions
- The ability to manage and track multiple workstreams simultaneously
What’s in it for you?:
- Competitive salary that reflects your skills, experience and potential.
- Discretionary bonus scheme that recognises your hard work and contributions to esure’s success.
- 25 days annual leave, plus 8 flexible days and the ability to buy and sell further holiday.
- Our flexible benefits platform is loaded with perks to choose from, so you can build a personal toolkit to support your health, wellbeing, lifestyle, and finances.
- Company funded private medical insurance for qualifying colleagues.
- Fantastic discounts on our insurance products! 50% off for yourself and spouse/partner and 10% off for direct family members.
- We’ll elevate your career with hands-on training, mentoring, access to our exclusive academies, regular career conversations, and expert partner resources.
- Driving good in the world couldn’t be more important to us. Our colleagues can use 2 volunteering days per year to support their local communities.
- Join our internal networks and communities to connect, learn, and share ideas with likeminded colleagues.
- We’re a proud supporter of the ABI’s ‘Make Flexible Work’ campaign and welcome you to ask about the flexibility you need. Our hybrid working approach also puts you in the driving seat of how and where you do your best work.
- And much more; See a full overview of our benefits here Reward and benefits | Esure Group PLC
We are committed to creating an inclusive and diverse workplace where everyone feels valued, respected, and empowered. We celebrate individuality and create spaces where unique backgrounds and experiences can come together. We believe that diverse perspectives drive innovation, in turn enabling us to better serve our customers, community and build a stronger organisation. Our commitment to inclusion extends to every part of our business, from hiring practices to professional growth opportunities, ensuring equal access and support for all.