1,414 Devops Engineers jobs in the United Kingdom
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Senior Site Reliability Engineer
London - Hybrid
80,000 - 90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension
Excellent opportunity for Site Reliability Engineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression!
This company operates at the forefront of digital transformation, delivering a unified platform built for scalability, resilience, and performance. With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries.
In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems.
The ideal candidate will be an experienced Site Reliability Engineer with a deep background in AWS, Kubernetes (EKS), Terraform, and monitoring/eventing tools. You'll have a strong grasp of application-level troubleshooting, chaos engineering, and performance tuning.
This is a fantastic opportunity to work in a modern DevOps environment where innovation is encouraged, personal development is supported, and technical impact is real.
The Role:
*Manage and optimise AWS and Kubernetes (EKS) infrastructure
*Implement resilience strategies and conduct chaos engineering experiments
*Monitor and maintain Kafka clusters for performance and reliability
*Respond to and resolve application-level production incidents
The Person:
*5+ years in SRE, DevOps, or infrastructure engineering
*Strong experience with AWS, EKS/Kubernetes, and Terraform
*Familiar with Kafka and observability tools like Datadog or Grafana
*Able to troubleshoot issues across infrastructure and application layers
Reference number: BBBH(phone number removed)
To apply for this role or for to be considered for further roles, please click "Apply Now" or contact Tommy Williams at Rise Technical Recruitment.
Rise Technical Recruitment Ltd acts an employment agency for permanent roles and an employment business for temporary roles.
The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set. We are an equal opportunities employer and welcome applications from all suitable candidates.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Senior Site Reliability Engineer
London - Hybrid
80,000 - 90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension
Excellent opportunity for Site Reliability Engineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression!
This company operates at the forefront of digital transformation, delivering a unified platform built for scalability, resilience, and performance. With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries.
In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems.
The ideal candidate will be an experienced Site Reliability Engineer with a deep background in AWS, Kubernetes (EKS), Terraform, and monitoring/eventing tools. You'll have a strong grasp of application-level troubleshooting, chaos engineering, and performance tuning.
This is a fantastic opportunity to work in a modern DevOps environment where innovation is encouraged, personal development is supported, and technical impact is real.
The Role:
*Manage and optimise AWS and Kubernetes (EKS) infrastructure
*Implement resilience strategies and conduct chaos engineering experiments
*Monitor and maintain Kafka clusters for performance and reliability
*Respond to and resolve application-level production incidents
The Person:
*5+ years in SRE, DevOps, or infrastructure engineering
*Strong experience with AWS, EKS/Kubernetes, and Terraform
*Familiar with Kafka and observability tools like Datadog or Grafana
*Able to troubleshoot issues across infrastructure and application layers
Reference number: BBBH(phone number removed)
To apply for this role or for to be considered for further roles, please click "Apply Now" or contact Tommy Williams at Rise Technical Recruitment.
Rise Technical Recruitment Ltd acts an employment agency for permanent roles and an employment business for temporary roles.
The salary advertised is the bracket available for this position. The actual salary paid will be dependent on your level of experience, qualifications and skill set. We are an equal opportunities employer and welcome applications from all suitable candidates.
Site Reliability Engineer

Posted 2 days ago
Job Viewed
Job Description
Come join a team that is striving for operational awesomeness and trying to automate the world. We have a large presence with large cloud vendors. You should have experience with architecture, deployments, and networking in one or more of the major industry vendors. This is an incredible opportunity to use your existing cloud experience and drive the growth of Splunk Cloud.
**What we're looking for**
**NOTE:** **4 x 10h shifts: Wednesday - Saturday/8am-6pm**
We are looking for a TechOps SRE to help maintain, contribute to and improve the next generation of our large scale Cloud offering. You will be working with providers and supporting the infrastructure that powers Splunk's cloud offering.
**You should apply if**
+ **you are comfortable working 4 x 10h shifts: Wednesday - Saturday/8am-6pm**
+ You have operational experience at scale. You have had hands-on roles that deal with operating systems (particularly Linux) and networking. You might also have worked with Cloud technologies. Your previous job titles might be something close to systems admin, network engineer or devops engineer.
+ You're passionate about your work. Our customers are passionate about Splunk and we want the same from our engineers. You should enjoy actively being responsible for your work and be excited about your projects.
+ You love large complex systems. Experience in working on distributed systems or a passion for finding edge cases that appear at scale. You are interested in how to bring something from a small one off task to how to implement it across several thousand machines at once.
+ You have some development skills. We have code in several languages, ranging from Python and Shell to Go and C++. We don't expect you to be a software engineer but you should be familiar with basic programming and understand concepts like input sanitisation and unit testing.
+ "How can I automate this process?" is a question you constantly ask yourself.
+ Data drives your decisions. Data excites you and you make decisions based on numbers rather than assumptions. If an issue arises, you strive to be alerted before our customers notice.
+ You care about monitoring. Shipping code often and getting useful feedback excites you and you're not worried about changing direction when a solution isn't working as expected.
**What we provide**
+ Opportunities to develop and grow as an engineer. We are always expanding into new areas, working with open-source projects and contributing back, and exploring new technologies.
+ A team of incredibly capable and dedicated peers, all the way from engineering to product management and customer support.
+ Breadth and depth. You are interested to work in an area that dynamically scales to meet the need of Splunk's cloud offering. You want to go deep into optimizing how we automate every manual process and tedious task we encounter.
+ Growth and mentorship. We believe in growing engineers through ownership and leadership opportunities. We also believe that mentors help both sides of the equation.
+ A stable, collaborative, and supportive work environment. Honesty and collaboration are values we see as a core part of our team identity. We understand the value in open communication-working together to get things done, and to adapt to the changing needs of the team and individuals. This is reflected in both our internal communications and also in how we interact with our customers.
+ Balance. We don't expect people to work 12 hour days. We want you to be successful outside of work too. We trust our colleagues to be responsible with their time and commitment, and believe that balance helps cultivate a positive environment.
Splunk, a Cisco company, is an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Key Responsibilities:
- Design, build, and maintain scalable and reliable infrastructure using infrastructure-as-code principles.
- Develop and implement automation tools and scripts to streamline deployment, monitoring, and operational processes.
- Monitor system performance, identify bottlenecks, and implement solutions to improve efficiency and reliability.
- Respond to and resolve production incidents, performing root cause analysis and implementing preventative measures.
- Collaborate with development teams to ensure applications are designed for reliability and operability.
- Implement and manage CI/CD pipelines for efficient software delivery.
- Manage cloud infrastructure (AWS, Azure, GCP) and container orchestration platforms (Kubernetes, Docker).
- Develop and maintain comprehensive documentation for systems, processes, and procedures.
- Participate in on-call rotations to provide 24/7 support for critical systems.
- Continuously evaluate and improve system architecture, tools, and processes to enhance reliability and performance.
Qualifications:
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or Systems Engineer.
- Strong proficiency in at least one programming language (e.g., Python, Go, Java).
- Experience with cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes).
- Solid understanding of networking concepts, operating systems (Linux), and system administration.
- Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Excellent problem-solving, analytical, and communication skills.
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent practical experience).
This position is based at our office in Aberdeen, Scotland, UK , and requires your physical presence.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Site Reliability Engineer
Posted 5 days ago
Job Viewed
Job Description
Gizmo is an AI startup on a mission to make learning so easy that anyone can learn anything. We're building Duolingo for anything - a platform that uses gamification and social mechanics to make learning fun.
With over 1 million monthly active users and $4M in annual recurring revenue, we’re already one of the fastest-growing startups in the UK. Backed by leading investors, we recently raised $22M in Series A funding to accelerate our vision of helping 1 billion people learn.
Role Overview
Reporting to the CTO, you will own capacity, performance and reliability for Gizmo’s full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You’ll write code across the stack, but your charter is classic SRE: defend SLOs , eliminate toil , and raise the ceiling on scale before it becomes a hard limit.
Key Responsibilities
- Define SLIs/SLOs for latency, availability and error rate; codify error budgets and partner with product teams on trade-offs.
- Perform load-testing, capacity modelling and up-front scalability design for PostgreSQL, OpenSearch, Redis, Hasura and CF Workers; produce data-driven scaling plans.
- Extend metrics, structured logging and tracing; establish alert rules that page only on user-visible impact; build actionable runbooks .
- Join the on-call rotation, lead blameless post-mortems, drive remediation work to closure and track MTTR/MTBF improvements.
- Automate repetitive ops on Kubernetes and CI/CD; keep “toil” <50 % of your time by pushing fixes into code.
- Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an SRE playbook
Requirements
- Hands-on scale experience : you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL).
- Strong backend fundamentals around concurrency, caching, indexing and distributed systems trade-offs.
- Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts.
- Comfort with Kubernetes , IaC and cloud-native patterns; can debug from network to application layer.
- Start-up bias for action: you prioritise high-leverage fixes, ship iteratively and own outcomes end-to-end.
- Collaborative and feedback-driven; you welcome post-mortem culture and continuous improvement.
- Driven by impact - you prioritise work that moves the needle!
Nice-to-haves: experience with Hasura internals, Cloudflare Workers edge optimisation, or operating OpenSearch at scale.
Benefits
- Highly competitive salary.
- You'll own a piece of what you're building - equity included.
- Hybrid working model with 4 days in our East London office, ideally located between Shoreditch High Street, Old Street, and Liverpool Street stations.
- The opportunity to become one of the earliest employees in one of the UK’s fastest-growing startups.
- Private health insurance
Site Reliability Engineer
Posted 563 days ago
Job Viewed
Job Description
Opportunity
This is an opportunity for a software professional to take a key role in ensuring seamless and reliable access to our Quantum Computers for a global audience. This role would be an ideal opportunity for someone with a strong Linux system administration or DevOps background to make the move into their first Site Reliability Engineer role.
The successful candidate will join a dynamic and supportive team and be empowered to design & develop new tools and automation processes that streamline operations and enhance system reliability in one of the most cutting edge areas of the technology sector.
This is not a 9-5 role and will require the post holder to be part of an on-call rota to ensure we provide 24hr support.
Remuneration + Benefits
- £60-70k per annum
- Private medical insurance
- Group life and group income protection
- Gym and wellness benefits
- EAP cash plan
- Cycle to work scheme
- 25 days holiday
- Pension
- Employee Stock Ownership Plan (ESOP)
- Hybrid working
The Role
As a Site Reliability Engineer, you will be focusing on building and operating reliable systems and services that ensure high availability, resilience, and scalability for our global network for Quantum Computers. You will also be responsible for monitoring, diagnosing and controlling distributed production environments to avoid manual intervention, while ensuring compliance with high security standards and regulations.
In this role you will solve complex problems related to infrastructure, cloud services & quantum compilation processes, also build automation to avoid manual intervention and prevent problem recurrences. You will actively improve our operation capabilities and hot-fix issues autonomously wherever possible and be part of a support-system, enabling round-the-clock highly responsive support. You will build an outstanding code-level knowledge of all our production products and will be able to work in and out of these teams, steering these products for observability and reliable operations.
This role would be suited either to someone with SRE experience or an experienced Linux/ DevOps / Cloud engineer interested in moving into an SRE role.
- Ensuring high security standards
- Maintain scalability and availability of our QCaaS system
- Improving technical readiness levels
- Be active on development teams
- Building and operating operation systems
- Monitoring and diagnosing issues
- Testing software in development (including destructive testing)
- Performing roll-backs of software when issues arise.
- Solve complex problems related to infrastructure, cloud services & quantum compilation processes
Skills + experience
Required Skills and experience
- DevOps Engineering experience
- Software Developer/ Engineering experience
- Maintenance of cloud infrastructure
- Relevant commercial experience
- Experience in Docker/Kubernetes and Azure/AWS
- Strong knowledge of GitHub CI/CD process for Python applications
- Strong knowledge database such as Postgres and PL/SQL
- Proficient with commonly used networking protocols such as TCP/IP, HTTP
- Technical communicator
- Strong troubleshooting and performance tuning skills.
- Autonomous worker, with the confidence to take on tasks when the rest of the team is unavailable e.g. outside office hours
- Experience in Automation of testing / monitoring tools
- Team player
Desired Skills
- Experience managing on-premises environments would be highly beneficial
- Experience developing new tools from the ground up
- Experience of on-call positions
- Degree in related field (Computing, numerical, scientific)
Research has shown that women are less likely than men to apply for this role if they do not have solid experience in 100% of these areas. Please know that this list is indicative and that we would still love to hear from you even if you feel you only are a 75% match. Skills can be learnt, diversity cannot.
Our Company
At OQC, we see a brighter future for all, enabled by quantum.
Together we are pioneering cutting-edge quantum computers that unlock transformative discoveries, from advancing drug modelling to revolutionising battery technology. Our mission is to put quantum in the hands of humanity, empowering customers to discover new commercial and scientific frontiers.
When you join OQC, you become part of a diverse team of innovators, creators, and problem solvers. We bring together some of the brightest minds in quantum physics, nanotechnologies, hardware, software and commercial operations. Each team member brings a unique skill set and are united by our values, which guide us in everything we do - how we work, how we collaborate and how we shape the future of our industry.
Are you ready to help us build this future?
APPLY NOW!
Please use the link provided to apply for the role of Site Reliability Engineer. To aid your application, it will be beneficial to provide us with a cover letter outlining why you think you would be a good fit for the role and what attracts you to OQC. We look forward to hearing from you!
At OQC we are not just hoping you’ll fit in our culture. We aspire to thrive, as a company and as people, thanks to your diversity of thought and background. We are proud to be an equal opportunity employer and we are committed to providing our team members with a work environment free from discrimination, where everyone is treated with respect. Our employment decisions are based on business needs, talent and merit and all our colleagues share in the responsibility for fulfilling our commitment to diversity. We look forward to meeting you!
Be The First To Know
About the latest Devops engineers Jobs in United Kingdom !
Site Reliability Engineer
Posted 592 days ago
Job Viewed
Job Description
Remote Site Reliability Engineer - Cloud Infrastructure
Posted 4 days ago
Job Viewed
Job Description
You will be responsible for designing, building, and maintaining the infrastructure that powers our applications. This includes implementing infrastructure as code (IaC) using tools like Terraform or CloudFormation, managing CI/CD pipelines, and developing sophisticated monitoring and alerting systems. Proactive identification and resolution of potential issues, capacity planning, and performance tuning are key aspects of the role. You will participate in on-call rotations to respond to and resolve production incidents, conducting post-mortems to prevent recurrence. Collaboration with development teams to ensure reliability is designed into new features from the outset is crucial.
The ideal candidate will have a strong background in systems administration, software development, or a related field, with demonstrable experience in SRE principles. Proficiency in cloud platforms (AWS, Azure, or GCP), containerization technologies (Docker, Kubernetes), and scripting languages (Python, Bash, Go) is essential. Experience with monitoring tools (Prometheus, Grafana, Datadog) and CI/CD tools (Jenkins, GitLab CI) is highly valued. A Bachelor's degree in Computer Science or a related field, or equivalent practical experience, is required. Excellent problem-solving skills, a strong understanding of distributed systems, and the ability to work independently in a remote setting are critical for success. Join us to build and maintain resilient, highly available systems from anywhere.
Responsibilities:
- Design, implement, and manage scalable and reliable cloud infrastructure.
- Develop and maintain Infrastructure as Code (IaC) using tools like Terraform.
- Build and optimize CI/CD pipelines for efficient deployments.
- Implement comprehensive monitoring, logging, and alerting solutions.
- Automate operational tasks and processes to improve efficiency.
- Respond to and resolve production incidents, leading post-mortems.
- Collaborate with development teams to ensure system reliability and performance.
- Conduct capacity planning and performance tuning.
- Ensure security best practices are implemented across the infrastructure.
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
- 3+ years of experience in Site Reliability Engineering, DevOps, or a similar role.
- Proficiency with cloud platforms (AWS, Azure, GCP).
- Strong experience with containerization technologies (Docker, Kubernetes).
- Expertise in scripting languages such as Python, Bash, or Go.
- Experience with IaC tools (Terraform, Ansible, CloudFormation).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
- Understanding of networking concepts and protocols.
- Excellent troubleshooting and problem-solving skills.
- Ability to work effectively in a remote, collaborative team environment.
Remote Site Reliability Engineer - Cloud Infrastructure
Posted 5 days ago
Job Viewed
Job Description
As a remote-first employee, you will collaborate closely with development and operations teams to implement best practices for site reliability. Your responsibilities will include developing and deploying automation tools, managing infrastructure as code (IaC) using platforms like Terraform or Ansible, and optimizing CI/CD pipelines. You will actively participate in on-call rotations to address production incidents and perform root cause analysis. The ideal candidate will possess a deep understanding of containerization technologies (Docker, Kubernetes), distributed systems, and cloud platforms (AWS, Azure, GCP). Strong scripting skills (Python, Go, Bash) are essential, as is a proactive approach to identifying and mitigating potential system failures. This is an exceptional opportunity to contribute to a high-growth tech environment with a fully remote setup.
Key Responsibilities:
- Design, build, and maintain scalable and reliable cloud infrastructure.
- Develop and implement automation for deployment, monitoring, and incident management.
- Manage and optimize container orchestration platforms (e.g., Kubernetes).
- Utilize Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible) for provisioning and configuration management.
- Monitor system performance, identify bottlenecks, and implement performance tuning strategies.
- Respond to and resolve production incidents, conducting thorough root cause analysis.
- Collaborate with development teams to improve application reliability and deployability.
- Implement and manage CI/CD pipelines to ensure efficient software delivery.
- Develop and maintain robust monitoring, logging, and alerting systems.
- Contribute to capacity planning and disaster recovery strategies.
- Participate in on-call rotations to provide 24/7 support for critical systems.
- Document infrastructure, processes, and runbooks.
Required Qualifications:
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or in a similar infrastructure role.
- Strong proficiency in cloud platforms (AWS, Azure, or GCP).
- Expertise with containerization technologies like Docker and Kubernetes.
- Hands-on experience with IaC tools such as Terraform, Ansible, or CloudFormation.
- Proficiency in one or more scripting languages (e.g., Python, Go, Bash).
- Solid understanding of networking concepts (TCP/IP, DNS, HTTP).
- Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI).
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack).
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration abilities, essential for remote teamwork.