Site Reliability Engineer

The Zonal group are one of the UK’s largest technology providers to the hospitality industry.

Our products are used by over 11,000 pubs, restaurants, and hotels. Customers include national brands like Pizza Express, JD Wetherspoons and All Bar One.

We provide our customers with the solutions they need to make their business a success.

These solutions include mobile apps for ordering and web apps for engaging with consumers either through loyalty or reservations. By linking these solutions to Zonal’s EPoS (till) system, we help hospitality brands to understand their customers’ behaviour and preferences, enabling them to excel in an increasingly competitive market.

If you have booked a table or hotel room, ordered, and paid for food and drinks, received loyalty offers, or downloaded your favourite hang out’s app, you will likely have used a Zonal product.

We are a family business with Scottish roots. We operate from our modern head office in Edinburgh to our Marketing Technologies Division in Staffordshire, or our Innovation Centre in Abingdon and hotel management solutions base in Cardiff.

What you’ll do

This role sits within the Zonal Managed Services team and is part of the wider Zonal Technical Services business unit.

Our suite of SaaS, distributed systems and product integrations help our customers run their critical business operations and provide their customers in turn with industry leading hospitality technology products.

You’ll play a key role in the formation of a new area within Zonal: our Production Operations (ProdOps) team that aims to drive operational excellence and customer focus into the operation of our SaaS hosted application suite.

As a Site Reliability Engineer, you’ll combine your software and systems engineering expertise in platforms, system design, integration, deployment and ongoing maintenance to help build and run our industry leading SaaS solution.

You will help build a new key Site Reliability Engineering (SRE) function within Zonal, implementing best practice monitoring, processes and tooling focussed on uptime, performance, and reduction of toil.

You will work closely with engineering teams in Development and Platform Delivery (Networks, DBA, and Infrastructure) to uphold contracted Service Level Objectives (SLOs).

You will be tasked with ensuring our internal and externally available systems have reliability, and uptime appropriate to user’s needs.

The makeup of our systems is changing rapidly, and you’ll play a key part in helping us drive this forwards. We’re moving towards a modern DevOps landscape with technologies like Docker, IaC and microservices. Initially we are working with our own hosted data centre infrastructure technology however you’ll play a key role in our drive towards a future hybrid public-cloud position.

You will contribute towards driving an organisational change into Zonal focussing on the core tenants of Site Reliability Engineering, namely the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of our SaaS service(s).

You and your team will:

  • Build strong, collaborative relationships acting as the glue between in-house customer facing support and delivery teams, product management, and platform engineering (R&D) teams
  • Create frameworks to continually improve:
    • Observability through logging, monitoring, and alerting
    • Capacity analytics and demand management
    • Dashboards, internal and external status pages
    • CI / CD pipelines, release processes
    • Automation of manual processes, tooling and IaC including security checks and break-glass procedures
    • Ownership of some cross-cutting implementation like logs / metrics infrastructure
    • Team processes, driving technical debt down
    • Triage, response, and recovery times
    • Disaster Recovery models and planning
  • Reduce toil (work that is largely manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as our services grow), maximising engineering capacity
  • Bring expertise and a streetwise perspective to problem solving, reduction of complexity and reliability patterns
  • Participate in On-Call cover and Incident Response
  • Play a key role in change management and delivery pipeline into production ensuring safety, predictability, repeatability, and auditability of all build and deploy processes
  • Proactively manage delivery of key SLOs covering Detection / early warning and self-healing
  • Own all monitoring and alerting applications, services and infrastructure
  • Act as key stakeholders in the technical debt reduction of our Products by contributing towards the Tech Debt backlogs for our R&D teams

Who you are

You will have a background in deploying, managing, and operating mission critical distributed SaaS systems having spent at least some of your career as a member of a fast-paced product engineering, web operations and/or platform delivery team. Ideally you will have a demonstrable track record of operating systems in hybrid datacentre/cloud infrastructures.

When things go wrong – and they will – we want to know about the problem before our users, so will need your deep understanding of modern monitoring and toolsets for identification, investigation, and remediation.

  • A self-starter with a passion for technology and problem solving, with excellent analytical skills who thrives in a fast-paced autonomous environment
  • Comfortable with core infrastructure services like messaging, DNS and public internet-facing layer 7 application delivery and load balancing, API gateways and encryption
  • Solid experience in scripting, tooling, automation, and Infrastructure as Code (IaC) frameworks
  • In-depth knowledge and experience of monitoring and logging platforms, Zabbix and GrayLog a bonus
  • Excellent understanding of traditional ops areas in Windows/Linux stacks, networking
  • Familiarity with container ecosystems including technologies such as Kubernetes, Docker
  • Quick to spot opportunities and new capabilities in technologies
  • A champion of resilience, consistency, security, and monitoring
  • Familiar with relational and NOSQL database technologies
  • Comfortable in complex provisioning and deployment scenarios, modern CI/CD pipelines
  • A strong collaborator, organised, and has a safe pair of hands
  • A team player that enjoys influencing change, leading and developing a technical function
  • Comfortable communicating to mixed audiences of Support, Product Delivery, Engineering, and Incident Management
  • Critical thinking skills, a delivery-oriented attitude, with attention to detail.
  • Minimum 3+ years’ experience building, deploying and maintaining production software.

What we value

Passion, Teamwork, Innovation and Professionalism are the values we believe make us the company we are. We are looking for someone who understands great culture and will help us shape it as it evolves.

Click to read the full spec

Interested in this role? Want to know more?

Apply Now