Site Reliability Engineer
Israel - CTO - Full-time - Intermediate
Glassbox is looking for a Site Reliability Engineer to join our Global Production DevOps team.
We are Glassbox, and our mission is to reveal the insights that empower organizations to deliver exceptional digital customer experiences.
We are growing and have been recognized by G2 as one of 2023's Top 100 Software Companies in the world.
Our customers are the best of the best and include six out of the ten largest global banks, the world’s largest hotel chain, the largest healthcare and the largest telecommunications company in the U.S.
Now is the perfect time to come to Glassbox and help us accelerate our global leadership position!
If you are a dynamic, successful, experienced metrics-driven leader, Glassbox might be a great fit.
Will you join us on this journey?
SRE at Glassbox:
SRE in Glassbox – Responsible for the availability and operations of production of Glassbox SAAS solution in AWS and Azure.
The Glassbox production environment is built out of thousands of servers and Petabytes of storage.
The production is based on K8S and docker and based on cutting-edge technology and serving many of the Fortune 500 customers.
What You Will Do
Availability:
- Provide technical/operational support for customers according to SLA.
- Add automation and context to alerts and prevent availability issues.
Performance:
- Perform proactive tasks on SAAS environments including creating and gathering insight from the dashboard - fault detection, isolation, resolution, and root cause analysis if needed.
Monitoring:
- Define, create, and implement monitoring solutions
Incident Response:
- Build runbooks for NOC/SOC team.
- Act as the #2 tier for incidents during working hours and part of 24x7 shifts rotation
- Conducting post-incident reviews
Preparation:
- Perform and maintain the system via automation using scripts and configuration management tools
- Work closely with the Cloud DevOps team to transition products from development to the production environment via continuous integration and deployment processes
- Operate CI/CD DevOps tools such as Git, Jenkins, Ansible, Terraform on AWS-/Azure-based SaaS production systems.
- Lead the deployment, maintenance, and management of mission-critical AWS-based SaaS production systems to ensure 24/7 availability, performance, and scalability
What You Will Need
- 2+ years working with LINUX operating systems
- 2+ years of working with the core AWS/Azure services
- 2+ years SaaS Operations management experience
- 2+ years of experience with enterprise Web applications and Cloud products
- Experience with operations of CI/CD DevOps tools such as Git, Jenkins, Ansible, and Terraform
- Experience with Docker and Kubernetes
- Experience with cloud monitoring and management tools (Datadog, PagerDuty, Prometheus, etc.)
- Comfortable speaking and writing in English (intermediate/advanced preferred)
- Strong communication and collaboration skills
Advantage
- Bachelor's degree in Computer Information Systems, Management Information Systems, Computer Science, or another related field experience
- Writing scripts in various languages (Bash, Perl, Python)
- Assertive, confident, fast learner, and comfortable working in a fast-paced environment