Site Reliability Engineer – Remote

Are you our missing ingredient?

The Zonal group are one of the UK’s largest technology providers to the hospitality industry.

Our products are used by over 11,000 pubs, restaurants and hotels. Customers include national brands like Pizza Express, JD Wetherspoons and All Bar One.

If you’ve booked a table or hotel room, ordered and paid for food and drinks, received loyalty offers, or downloaded your favourite hang out’s app, you will likely have used a Zonal product.

We are a family business with Scottish roots. We operate from our modern head office in Edinburgh to our Division in Staffordshire, or our Innovation Centre in Abingdon and hotel management solutions base in Cardiff.

We offer remote working within the UK. We also reward our staff well with very competitive salaries, generous holiday allocation and well-structured career development plans.

What you’ll do

The makeup of our systems is changing rapidly, we need you to play a key part in helping us drive this forward. We are moving towards a modern DevOps landscape with technologies like Docker, IaC and microservices.

Initially we are working with our own hosted data centre infrastructure technology however you will play a key role in our drive towards a future hybrid public-cloud position.

As a Site Reliability Engineer, you will combine your software and systems engineering expertise in platforms, system design, integration, deployment and ongoing maintenance to help build and run our industry leading SaaS solution.

We want you to help us build a new key Site Reliability Engineering (SRE) function within Zonal, implementing best practice monitoring, processes and tooling focussed on uptime, performance, and reduction of toil.

Responsibilities will involve: –

Build strong, collaborative relationships acting as the glue between in-house customer facing support and delivery teams, product management, and platform engineering (R&D) teams

  • Create frameworks to continually improve:
    • Observability through logging, monitoring, and alerting
    • Capacity analytics and demand management
    • Dashboards, internal and external status pages
    • CI / CD pipelines, release processes
    • Automation of manual processes, tooling and IaC including security checks and break-glass procedures
    • Ownership of some cross-cutting implementation like logs / metrics infrastructure
    • Team processes, driving technical debt down
    • Triage, response, and recovery times
    • Disaster Recovery models and planning

Reduce toil (work that is largely manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as our services grow), maximising engineering capacity

  • Bring expertise and a streetwise perspective to problem solving, reduction of complexity and reliability patterns
  • Participate in On-Call cover and Incident Response
  • Play a key role in change management and delivery pipeline into production ensuring safety, predictability, repeatability, and auditability of all build and deploy processes
  • Proactively manage delivery of key SLOs covering Detection / early warning and self-healing
  • Own all monitoring and alerting applications, services and infrastructure
  • Act as key stakeholders in the technical debt reduction of our Products by contributing towards the Tech Debt backlogs for our R&D teams

Who you are

You will have a background in deploying, managing, and operating mission critical distributed SaaS systems having spent at least some of your career as a member of a fast-paced product engineering, web operations and/or platform delivery team.

Ideally you will have a demonstrable track record of operating systems in hybrid datacentre/cloud infrastructures.

When things go wrong – and they will – we want to know about the problem before our users, so will need your deep understanding of modern monitoring and toolsets for identification, investigation, and remediation.

We need you to be a self-starter with a passion for technology and problem solving, with excellent analytical skills who thrives in a fast-paced autonomous environment.

  • Comfortable with core infrastructure services like messaging, DNS and public internet-facing layer 7 application delivery and load balancing, API gateways and encryption
  • Solid experience in scripting, tooling, automation, and Infrastructure as Code (IaC) frameworks
  • In-depth knowledge and experience of monitoring and logging platforms, Zabbix and GrayLog a bonus
  • Excellent understanding of traditional ops areas in Windows/Linux stacks, networking
  • Familiarity with container ecosystems including technologies such as Kubernetes, Docker.
  • Quick to spot opportunities and new capabilities in technologies.
  • Familiar with relational and NOSQL database technologies
  • Comfortable in complex provisioning and deployment scenarios, modern CI/CD pipelines

We are going on an exciting journey and we need more like-minded travellers to help us get there!

If this sounds like you then we would love to hear from you!

Click to read the full spec

Interested in this role? Want to know more?

Apply Now