560 Site Reliability Engineer jobs in the United Kingdom
Site Reliability Engineer
Posted today
Job Viewed
Job Description
HCLTech is a global technology company, home to more than 220,000 people across 60 countries, delivering industry-leading capabilities centered around digital, engineering, cloud and AI, powered by a broad portfolio of technology services and products. We work with clients across all major verticals, providing industry solutions for Financial Services, Manufacturing, Life Sciences and Healthcare, Technology and Services, Telecom and Media, Retail and CPG, and Public Services. Consolidated revenues as of 12 months ending December 2024 totaled $13.8 billion.
SRE for production support of mission critical tokenization platform. Candidate should be strong in ITSM process and hands-on with automation scripting and cloud technologies.
Good to have proficiency with:
- Programming - Java, vert x, Python, Shell Scripting, GO, REST
- SRE - Kubernetes, Splunk/ELF, Openshift, CI/CD
- DB Postgres/Couchbase/Oracle
Technical Skill
Managing production support for mission-critical platforms
Implementing and following ITSM processes for incident handling
Writing automation scripts using Shell, Python , or Go
Deploying and managing Kubernetes clusters in production
Operating and troubleshooting OpenShift environments
Building and maintaining CI/CD pipelines for cloud-native apps
Monitoring and alerting using Splunk or ELF
Querying and tuning using Postgres or Oracle databases
Developing and debugging REST APIs for platform integration
Supporting Java and Vert.x based microservices in production
Managing Couchbase clusters and optimizing performance
Monitoring and resolving issues in Postgres/ Oracle databases
Site Reliability Engineer
Posted today
Job Viewed
Job Description
The Role
Speechmatics are seeking a Site Reliability Engineer (SRE) whose focus will be improving the reliability of our products, systems and infrastructure. You will work across teams to improve availability, scalability, performance and efficiency of our real-time AI inference APIs.
You will get to work with high-scale GPU deployments spread across the world. Our customers expect low-latency responses, making this is a really interesting problem space to learn about.
What you'll be doing:
- Working with a diverse group of engineers across Speechmatics to improve reliability of our products and systems, from design through to operation in production.
- Taking part in incident response, postmortems and ensuring the same incident doesn't happen twice.
- Managing and improving GitOps release workflows and CI/CD pipelines.
- Monitoring system performance and troubleshooting production environments.
- Implementing observability improvements using OpenTelemetry tooling.
- Automating processes that reduces manual efforts and creates self-healing systems.
- Taking part in on-call rota for production systems that has a generous daily pay rate and mentorship programme to build your confidence in doing it.
Who we are looking for:
- Comfortable navigating and troubleshooting Linux systems directly from the command line.
- Hands-on experience with major cloud platforms such as AWS, Azure, and GCP.
- Skilled in managing containerised applications using tools like Docker, Kubernetes, and Helm.
- Proficient in Infrastructure-as-Code practices — our production stack includes Terraform, Datadog, ArgoCD, and GitLab.
- Strong focus on automation; you streamline workflows with Bash scripts and turn to Python when things get more complex.
- Curious about the entire technology landscape — from bare-metal servers to cloud abstractions — and motivated to understand how each layer fits together.
- Naturally inquisitive and eager to dive deep into new technologies; you thrive on learning as you go.
- Prior experience with on-call rotations and incident response is a plus.
- Familiarity with OpenTelemetry and related observability tooling is advantageous.
We encourage you to apply even if you do not feel you match all of the requirements exactly. The list of requirements is intended to show the kinds of experience and qualities we're looking for, but it is not exhaustive. If you are interested in the role, the team, and our mission, we would love to consider your application. We are always open to conversations and look forward to hearing from you.
Who we are:
Speechmatics is the leading expert in Speech Intelligence, and uses AI and Machine Learning to unlock business value in human speech worldwide. We work with an amazing mix of global companies, and our technology can integrate into our customers stack irrespective of their industry or use case – making it the go-to solution to harness useful information from speech.
Joining us means working with some of the smartest minds around the world, focused on cutting-edge projects and deploying the latest techniques to disrupt the market. We believe in putting people first; we'll do all we can to help you develop your skills and give you the tools you need to thrive. Our Focus Fridays give you an undisturbed day of focus, offset with Together Tuesdays when we have our team meetings, so you've always got the right balance.
We have structured a hybrid approach that includes 2-3 designated office days each week. This arrangement ensures that while we embrace the advantages of remote work, we also maintain the vital connection and synergy that only in-person interactions can foster.
This is only the beginning; we're looking for amazing people like you to continue our journey…
What we can offer you:
No matter what stage of your career you're at - from paid internships and first-job opportunities through to management and senior positions - we'll support you with the training and development needed to reach your career aspirations with us. There really is no shortage of opportunities here for you to get involved and collaborate with those around you to deliver your best work.
We offer incredibly flexible working, regular company lunches, and birthday celebrations. But that's not all. We've spoken to our teams to find out what they want. From Private Medical, and Dental for you and your family, through to global working opportunities, a generous holiday allowance and pension/401K matching, we want to make sure our employees and their families are looked after. Every employee will receive a working from home allowance for tech or home office equipment (on top of your choice of laptop and accessories of course). Our approach to parental leave is designed to support employees globally. While this varies by geo, we have support in place for parents (including adoption assistance and reproductive health services) to ensure they have the time and financial resources needed to care for their growing families.
At Speechmatics, our mission is simple: Understand Every Voice out there.
That's not just about our tech – it's the heart and soul of who we are. We welcome different experiences, viewpoints, and identities. For us, it's not just the right thing to do; it's our catalyst for sparking innovation and creativity. Our teams thrive in an environment that celebrates and supports everyone – no matter their gender, identity or expression, race, disability, age, sexual orientation, religion, belief, marital status, national origin, veteran status, pregnancy, or maternity status.
But we don't just open the door to diversity – we actively welcome it. Why? Because we believe every unique voice adds something special to our team, leading us to smarter solutions and a better workplace.
So, come as you are and join our Speechling community. We're building a place where every voice not only gets heard but is also respected and valued.
For more information on us, please visit our website and follow Speechmatics on our social channels via Twitter, Facebook, LinkedIn, and YouTube.
We rely on legitimate interest as a legal basis for processing personal information under the GDPR for purposes of recruitment and applications for employment.#LI-Hybrid
Site Reliability Engineer
Posted today
Job Viewed
Job Description
C3 AI (NYSE: AI), is the Enterprise AI application software company. C3 AI delivers a family of fully integrated products including the C3 Agentic AI Platform, an end-to-end platform for developing, deploying, and operating enterprise AI applications, C3 AI applications, a portfolio of industry-specific SaaS enterprise AI applications that enable the digital transformation of organizations globally, and C3 Generative AI, a suite of domain-specific generative AI offerings for the enterprise. Learn more at:
C3 AI
We are looking for a
Site Reliability Engineer
to join our team in London.
Responsibilities
- Maximize system uptime and availability, ensuring functional and performance SLAs.
- Establish end-to-end monitoring and alerting on all critical aspects.
- Solve complex problems for critical services and build automation to prevent problem recurrence.
- Influence and create new designs, architectures, standards, and methods for supporting the platform.
- Initiate and lead scripting and automation to streamline system updates and upgrades.
- Set up critical infrastructure, tools, and framework to streamline the deployment cycle.
- Work cross-functionally with Services and Engineering teams.
Qualifications
- Demonstrated experience in deploying, managing, and operating scalable and fault-tolerant Linux/Kubernetes/JVM-based infrastructure in AWS, GCP, and other public clouds.
- Expertise in Linux Operating Systems, Networking, and Database concepts.
- Experience with Cassandra (or another NoSQL alternative).
- Expertise in cloud providers, such as Amazon Web Services, Azure, and GCP.
- Experience with configuration management systems such as Ansible or Puppet.
- Experience in Ruby or Python; to automate and monitor systems.
- Excellent problem-solving, critical thinking, and communication skills.
- Experience supporting as a DevOps or sys admin for commercial SaaS solutions.
- BS or MS in Computer Science, related field, or equivalent professional experience.
C3 AI provides excellent benefits and a competitive compensation package.
C3 AI is proud to be an Equal Opportunity and Affirmative Action Employer. We do not discriminate on the basis of any legally protected characteristics, including disabled and veteran status.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Location(s): ((mfield3))
BAE Systems Digital Intelligence is home to 4,500 digital, cyber and intelligence experts. We work collaboratively across 10 countries to collect, connect and understand complex data, so that governments, nation states, armed forces and commercial businesses can unlock digital advantage in the most demanding environments.
Site Reliability Engineering is a rapidly growing concept in industry, with a remit to drive the quality, reliability and performance of essential systems. As a Site Reliability Engineer you'll be part of a team in BAE Systems at the forefront of this, delivering these benefits to a key national security customer. We are in the process of building our team and tools, and with your help will create a culture of continual improvement to revolutionise the way our customer's systems are built and maintained. This role blends operational product support with software engineering to create applications to understand the overall health of our systems. The SRE team sits within a wider programme at the core of the customer mission.
The Role Holder
As an SRE, fundamentally you will be doing work that has historically been done by an operations team, but using software and systems engineering expertise to substitute automation for human labour, with the objective of limiting traditional manual operations work (incident tickets, on-call etc.) to no more than half of the SRE team's time (and aiming for considerably less). You will have an enthusiasm to learn and experiment, to develop tools to understand application health and improve their reliability to support the customer mission.
Role Accountabilities Include
Supporting and maintaining essential service that support core mission applications, proactively enhancing their availability, performance and stability.
Being part of the 24/7 on call rota, supporting critical production systems out of business hours, for which additional on call allowances and overtime benefits will be paid.
inding innovative solutions to problems rather than undertaking repetitive work, automating everything you can. You will work alongside development teams, advising them of good practice in how to design and build systems, learning from what you know works well.
ou will design and deploy monitoring products, creating bespoke tools where required, to provide comprehensive and intelligent observations to meet the customer requirements and demonstrate the improvements the team are making on a daily basis. You will be well versed in the relationship between software and infrastructure, understanding the characteristics of systems that enable them to be scalable and resilient to failure, and how to get the best out of the infrastructure they are deployed to.
articipating in the wider DevOps/SRE community within the organisation.
Competancies
t is desirable for you to have experience in the areas below. However more valued for this role is that you have excitement and enthusiasm to learn new technologies, and to deal with hard problems. Training, knowledge sharing and on the job development will enable you to plug any knowledge gaps.
- Software development in web technologies and object oriented programming
- Database technologies such as Oracle SQL, Mongo, Postgres
- Know your way around Linux and Windows command lines, e.g. Bash and PowerShell
- Monitoring large systems using technologies such as Grafana, Prometheus, ELK, Splunk
- Experience of working in Agile teams, and the tooling that supports it, e.g. Atlassian
- Diagnosing and troubleshooting application issues resulting in service outages
- Troubleshooting skills across different levels of the stack
- Understanding of ITIL
- Micro-services architectures, Docker and container platforms such as Openshift, Kubernetes
wareness and insight into technology trends to adopt new cutting edge tools
Security Clearance
Due to the nature of our work, successful candidates for this role will be required to hold an active eDV before applying for this opportunity.
Life at BAE Systems Digital Intelligence
We are embracing Hybrid Working. This means you and your colleagues may be working in different locations, such as from home, another BAE Systems office or client site, some or all of the time, and work might be going on at different times of the day.
By embracing technology, we can interact, collaborate and create together, even when we're working remotely from one another. Hybrid Working allows for increased flexibility in when and where we work, helping us to balance our work and personal life more effectively, and enhance well-being.
Diversity and inclusion are integral to the success of BAE Systems Digital Intelligence. We are proud to have an organisational culture where employees with varying perspectives, skills, life experiences and backgrounds – the best and brightest minds – can work together to achieve excellence and realise individual and organisational potential.
Division overview: Capabilities
At BAE Systems Digital Intelligence, we pride ourselves in being a leader in the cyber defence industry, and Capabilities is the engine that keeps the business moving forward. It is the largest area of Digital Intelligence, containing our Engineering, Consulting and Project Management teams that design and implement the defence solutions and digital transformation projects that make us a globally recognised brand in both the public and private sector.
As a member of the Capabilities team, you will be creating and managing the solutions that earn us our place in an ever changing digital world. We all have a role to play in defending our clients, and this is yours.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Company Introduction
Mercor
connects elite creative and technical talent with leading AI research labs. Headquartered in San Francisco, our investors include
Benchmark
,
General Catalyst
,
Peter Thiel
,
Adam D'Angelo
,
Larry Summers
, and
Jack Dorsey
.
Role Overview
- Position: Site Reliability Engineer (SRE) – Full-Time, San Francisco
- Commitment: 40 hours per week
- As an SRE at Mercor, you'll build and automate systems to keep our platform reliable, scalable, and fast. You will work across every layer of the stack to drive measurable reliability improvements.
Responsibilities
- Mentor engineers on best practices for observability, alert management, and instrumentation.
- Lead incident response from triage through post-mortem and remediation.
- Own and improve load-testing, disaster-recovery, and chaos-engineering programs.
- Automate reliability checks, capacity planning, and service-level monitoring.
- Partner with product and platform teams to design for reliability and scalability from the start.
Requirements / Qualifications
Must-Have Qualifications
- Background in SRE
- Proficiency in Terraform, Python, Go
- Experience working with AWS
Preferred Qualifications
- Experience with RDBMS (MySQL)
- Experience with document storage systems (MongoDB)
- Experience with caching systems (Redis)
- Exposure to data warehousing (Snowflake)
- Previous work in a high-growth startup environment
Engagement Details
- Full-Time position
- Location: San Francisco
- Remote work flexibility
- Competitive compensation
Application Process (Takes 20-30 mins to complete)
- Upload resume
- AI interview based on your resume
- Submit form
Resources & Support
- For details about the interview process and platform information, please check:
- For any help or support, reach out to:
PS: Our team reviews applications daily. Please complete your AI interview and application steps to be considered for this opportunity.
,
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Site Reliability Engineer (Lead Level) | London | Up to £600 Inside IR35 | Hybrid (2 Days Onsite) | 6 months
I’m partnered with a major media and tech company looking for a Lead Site Reliability Engineer to support and scale their Video on Demand (VOD) infrastructure. You’ll work across modern tech stacks including AWS, GCP, Cassandra, and Kafka, helping deliver reliable, high-performance systems used by millions.
What you’ll do
- Lead project delivery while supporting day-to-day operations and incident management
- Build and manage infrastructure as code to improve reliability, scalability, and performance
- Design and implement new architectures and best practices for infrastructure and delivery
- Drive automation across monitoring, CI/CD, and deployment pipelines
- Mentor engineers and guide technical decisions within a fast-paced, cross-functional environment
What you’ll bring
- Strong Linux administration skills (Ubuntu preferred)
- Hands-on experience with AWS and GCP
- Proficiency in Terraform, Ansible, Jenkins, or GitLab CI
- Knowledge of Kafka, Cassandra, and relational or NoSQL databases
- Scripting skills in Python, Bash, Go, or Java
- Familiarity with monitoring tools like Prometheus, Nagios, or Icinga
- Understanding of networking fundamentals and virtualisation (e.g. VMware)
- Comfortable with on-call rotations and troubleshooting in live environments
Up to £600 per day (Inside IR35)
London | Hybrid (2 days onsite)
6-month contract, with strong potential to extend
If you’re an experienced SRE who enjoys taking ownership, leading technical delivery, and working on large-scale content platforms, I’d love to chat.
Apply or message me if you’d like to hear more.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Site Reliability Engineer (Lead Level) | London | Up to £600 Inside IR35 | Hybrid (2 Days Onsite) | 6 months
I’m partnered with a major media and tech company looking for a Lead Site Reliability Engineer to support and scale their Video on Demand (VOD) infrastructure. You’ll work across modern tech stacks including AWS, GCP, Cassandra, and Kafka, helping deliver reliable, high-performance systems used by millions.
What you’ll do
- Lead project delivery while supporting day-to-day operations and incident management
- Build and manage infrastructure as code to improve reliability, scalability, and performance
- Design and implement new architectures and best practices for infrastructure and delivery
- Drive automation across monitoring, CI/CD, and deployment pipelines
- Mentor engineers and guide technical decisions within a fast-paced, cross-functional environment
What you’ll bring
- Strong Linux administration skills (Ubuntu preferred)
- Hands-on experience with AWS and GCP
- Proficiency in Terraform, Ansible, Jenkins, or GitLab CI
- Knowledge of Kafka, Cassandra, and relational or NoSQL databases
- Scripting skills in Python, Bash, Go, or Java
- Familiarity with monitoring tools like Prometheus, Nagios, or Icinga
- Understanding of networking fundamentals and virtualisation (e.g. VMware)
- Comfortable with on-call rotations and troubleshooting in live environments
Up to £600 per day (Inside IR35)
London | Hybrid (2 days onsite)
6-month contract, with strong potential to extend
If you’re an experienced SRE who enjoys taking ownership, leading technical delivery, and working on large-scale content platforms, I’d love to chat.
Apply or message me if you’d like to hear more.
Be The First To Know
About the latest Site reliability engineer Jobs in United Kingdom !
Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
Create the future of travel with us ️
Whether it’s to visit the people closest to us, starting an exciting adventure, or a career-defining business trip, travel is an essential part of our lives. Yet we've all experienced the aches and pains of getting to our destination. Today, more than 4 billion airline passengers rely on technology that hasn't kept up with the expectations of the modern connected traveller.
That’s why we’ve started to rebuild the infrastructure that underpins the travel industry. We’re on a mission to unravel travel — simplifying systems and building the tools that will make the future of travel effortless.
We were part of Y Combinator S18's cohort and we are backed by Benchmark, Blossom, Index Ventures and Kima Ventures. A fantastic set of investors that has helped build some of the world's largest companies.
Our team in London is growing and we’re looking for talented people to join us on our journey
Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
Company Description
WALT Labs, a leading managed service provider, is dedicated to empowering businesses by harnessing the power of cloud technology. Our team specializes in delivering customized solutions tailored to meet the unique needs of our clients, driving growth and operational efficiency across industries. From supporting small businesses with seamless data migration to enabling large corporations to manage complex infrastructure projects, we provide exceptional service while staying at the forefront of cloud technology advancements.
Role Description
This is a full-time on-site role 3 days a week minimum in Kings Cross London. We are seeking a skilled Site Reliability Engineer with a strong focus on Google Cloud Platform (GCP) to join our dynamic team. In this role, you’ll be responsible for maintaining cloud infrastructure, managing incidents, and ensuring seamless operations for our clients. You’ll use tools like incident.io and JIRA to manage and resolve support requests efficiently.
Qualifications
- 8-10 years of experience managing applications and infrastructure performance.
- Proven experience with Google Cloud Platform (GCP) services.
- Familiarity with incident.io for incident tracking and management (of equivalent)
- Proficiency in using JIRA for task management and support workflows.
- Strong experience working with observability tools (Grafana)
- Strong troubleshooting and problem-solving skills in cloud environments.
- Understanding of cloud security and performance optimisation best practices.
- Knowledge of scripting or automation tools (e.g., Python, Terraform) is a plus.
- Excellent communication and customer service skills.
- Certifications in GCP (Professional certifications) are highly desirable.
- Ability to work under pressure and prioritise tasks effectively.
- Bachelor’s degree in Computer Science, Information Technology, or related field (or equivalent experience).
Responsibilities
- Provide technical support and resolve issues related to Google Cloud Platform (GCP) services and AWS.
- Manage and respond to cloud incidents using incident.io, ensuring timely resolution.
- Use JIRA to log, track, and prioritize support tickets and workflow tasks.
- Monitor and maintain cloud infrastructure for performance, reliability, and security.
- Collaborate with teams to identify and implement solutions to technical challenges.
- Assist in deploying, configuring, and optimising GCP resources.
- Create and maintain documentation for troubleshooting processes and best practices.
- Proactively identify opportunities to improve cloud environments and support processes.
- Support clients and stakeholders by providing clear communication and updates during incident resolution.
- Stay up-to-date with the latest GCP developments and contribute to team knowledge sharing.
Benefits
- 20 holiday days + bank holidays (earn 1.5 days every 3 years)
- Private health insurance
Site Reliability Engineer
Posted 2 days ago
Job Viewed
Job Description
My client is a top-tier global quantitative and systematic investment manager, applying a technology- and data-driven approach to solve some of the most complex challenges in financial markets. By combining expertise in research, engineering, and trading, they foster a culture of innovation that drives consistent, high-quality returns for investors.
They are seeking a pragmatic and commercially focused Site Reliability Engineer (SRE) to join their team. In this role, you will be pivotal in ensuring the reliability, scalability, and performance of the systems that underpin world-class trading strategies.
The Role
- Design and maintain scalable, reliable systems supporting front-line trading activity.
- Monitor and troubleshoot performance issues, ensuring high system availability.
- Partner with developers to enhance reliability, performance, and scalability.
- Build automation tools to streamline operations and improve efficiency.
- Ensure systems meet stringent security and compliance standards.
- Provide technical support and guidance across teams.
Requirements
- Proven experience in Site Reliability Engineering or a similar role.
- Strong knowledge of system architecture and cloud platforms.
- Proficiency in at least one programming language (Python, C++, or C#).
- Familiarity with monitoring, observability, and logging tools.
- Excellent problem-solving skills with strong attention to detail.
- Commercial mindset with a focus on business impact.
- Finance industry experience is a plus, but not essential.