Why we need you:

As a Senior Site Reliability Engineer, you will work as part of the team that manages and delivers monitoring and observability services across our production and pre-production systems.

Your responsibilities will include:
 

  • System design, configuration, integration, deployment, and operations of Observability systems and tools. These systems include collection of metrics/logs/events from gaming services, applications (client, middleware, backend) and infrastructure (AWS, on-premise). Together these Observability systems and tools serve as a critical part of PokerStars operations services
  • Design, deploy our Observability infrastructure and systems to the next level of availability and scale
  • Ensure our Observability platform exceeds goals for availability, capacity, efficiency, scalability, and performance
  • Develop metrics and log ingestion pipelines for high volumes of telemetry
  • Creating build and deployment pipelines for monitoring tools
  • Deployment of monitoring solutions into AWS, development and production environments
  • Developing a set of alerts and metrics to keep your own services alive and performing well
  • Collaborating with other SRE team members, working on improving efficiency and reliability of monitoring solutions
  • Collaborate with our Application Development teams to define the standards/APIs that ensure our Applications are emitting the right telemetry (metrics, logs, traces, events)
  • Collect, aggregate and visualize the collected metrics to provide visibility and standards for key indicators to understand the health of our most critical systems
  • Develop software to analyse real time metrics feeds and produce actionable insight. Longer term moving towards machine learning to surface anomalies automatically
  • Migrating Observability tools to Kubernetes
  • Evaluating, choosing, and implementing the next generation of Observability tools

Who are we looking for:

As a Senior SRE Observability Engineer, you have extensive working experience building/ integrating/ administering systems that leverage open-source monitoring tools at scale (e.g., InfluxDB/TICK Stack, Prometheus), Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) and Grafana. Some of your experience is focused on coding and scripting (mostly Python, Java and Bash). You have developed metrics and log ingestion pipelines for high volumes of telemetry. We are working with Atlassian products (Jira, Confluence, Bitbucket Server) so it’ll be good if you have used them too.

We try to follow the best methodologies and IT operations in an always-up, always-available service but you will be able to suggest any improvements. Our environment is Agile so it`ll be good if you have worked in such teams.

You are a quick learner who can adopt and devour a lot of information about our in-house framework and systems fast. In this position you will have to show your good soft skills and the ability to liaise with technical teams and product/business people. You can work under pressure whilst maintaining accuracy and attention to detail. As a team we are results oriented and rely on good communication to achieve success.

As the ideal candidate, you will have:

You have experience or exposure to the following technologies:
 

  • B.Sc. in Computer Science or similar
  • 4 years+ experience with Open-Source Monitoring & Observability tooling/integration
  • Time Series Databases (TSDB) - InfluxDB/TICK Stack, Prometheus
  • Elastic Stack (Elasticsearch, Logstash, Kibana, Beats)
  • Grafana
  • Full proficiency with Linux command line environment
  • Strong scripting in Python and Bash
  • Programming experience in Java, Golang is a big plus
  • Expertise in Configuration and Deployment Automation using Salt and/or Ansible
  • Monitoring protocols/frameworks – Prometheus/Influx line format, SNMP, JMX, Spring Boot Actuator
  • Building software using Jenkins, JFRog, Artefactory
  • Git and versioning software
  • AWS Cloud services
  • Containerisation experience (Kubernetes and Docker)
  • Middleware (Tomcat, Kafka)
  • Experience with Consul, Vault, Terraform is a plus
  • Some familiarity with open Observability initiatives (e.g., Open Tracing, Open Census, Open Metrics)
Technical Skills
Is a Remote Job?
Hybrid (Remote with required office time)
Employment Type
Full time

PokerStars is part of Flutter Entertainment Plc, a global sports betting, gaming and entertainment provider headquartered in Dublin and part of FTSE 100 index of the London Stock Exchange, which...

Apply Now