Site Reliability Engineer

Job Descriptions

A site reliability engineer (SRE) is responsible for the reliability, scalability, and performance of a company's technological systems and infrastructure. SRE involves the development and implementation of processes and tools to ensure that systems are available, performant, and secure. The technical skills and attention to detail of a site reliability engineer are crucial in maintaining the smooth functioning and reliability of all systems, leading to increased efficiency and productivity for the organization. A clear and thorough job description is important in attracting the most qualified and capable candidates for this role, as the success of the company depends on the effectiveness of its site reliability engineering team.

Hiring a

Site Reliability Engineer


Sign up for Rezi's 7-day free trial to post this job and hire.


Input your position's key details and required skills


Generate your full job descriptions with our AI Writer


Integrate your job opening directly to Rezi user accounts

Site Reliability Engineer

December 27, 2022


We are looking for a Site Reliability Engineer to join our team. The ideal candidate should have a passion for reliability, automation, and scalability and have experience with CI/CD pipeline, distributed computing and databases. Your main responsibilities will include developing and maintaining reliable systems, monitoring performance and improving scalability, and supporting the development team by providing technical expertise. You will be responsible for the health and performance of our systems and infrastructure, and you will be expected to identify, diagnose and proactively address issues. Additionally, you will be expected to develop and maintain automation for our services and operations. This is an exciting opportunity to join an innovative team and build high-quality, reliable systems.


• As a Site Reliability Engineer at Rezi, responsible for developing and maintaining the tools, processes and technologies needed to keep our systems running reliably and efficiently

• Develop and maintain monitoring systems to ensure services are running and performing as expected

• Diagnose and resolve production issues quickly to minimize impact to services

• Proactively identify and address potential system failures before they occur

• Automate tasks to reduce manual work, improve system reliability and reduce operational costs

• Design and implement solutions to improve system performance, scalability and security

• Collaborate with software engineers and other teams to ensure that services are designed to meet operational requirements and best practices

• Work with developers to ensure that applications are deployed and configured for optimal performance and reliability


• 5+ years of experience in software engineering, DevOps and/or Site Reliability Engineering.

• Proficient in at least one scripting language (Python, Bash, etc.) and one programming language (Go, Java, etc.)

• Knowledge of orchestration tools (Kubernetes, DC/OS, Swarm, etc.) and their internals

• Experience with automation/configuration management (Terraform, Ansible, Chef, Puppet, etc.)

• Ability to debug and optimize code and automate routine tasks

• Experience with monitoring and alerting tools (Prometheus, Grafana, etc.)

• Hands on experience with containerization (Docker, rkt, etc.)

• Knowledge of networking protocols and concepts (TCP/IP, DNS, HTTP, etc.)

• Experience with cloud computing services (AWS, GCP, etc.)

• Strong knowledge of Linux/Unix operating systems internals


CI/CD Pipeline
Distributed Computing

Site Reliability Engineer

December 27, 2022


We are looking for a Site Reliability Engineer to join our team. The successful candidate will be responsible for managing and troubleshooting our production systems and ensuring optimal performance. This is a critical role that requires excellent problem-solving skills and a deep understanding of CI/CD Pipeline, Distributed Computing, and Database technologies. You will be responsible for developing and maintaining automated processes to monitor the health and performance of our systems, as well as utilizing software to detect and diagnose system outages. Additionally, you will collaborate with other engineers and technical teams to ensure our systems are running smoothly and securely.To be successful in this role, you must have experience implementing and maintaining CI/CD pipelines, strong knowledge of distributed computing principles, and a deep understanding of database technologies. You should also be able to troubleshoot and identify potential risks and develop strategies to mitigate them. Above all, you should be a problem solver with a passion for automation and a commitment to excellence.


• As a Site Reliability Engineer at Rezi, responsible for developing software to automate and optimize operations and development

• Design and implement automated infrastructure, tools, and processes for testing, deploying, and monitoring software

• Develop and maintain monitoring, alerting and logging systems to ensure high availability of services

• Design and implement disaster recovery plans, and ensure reliability and scalability of services

• Coordinate with development and product teams to design and implement system architecture and best practices

• Develop and maintain automation scripts to manage and configure system resources

• Analyze and troubleshoot performance issues, identify root cause and resolve problems

• Collaborate with software engineering teams to ensure optimal system performance and scalability


• 5+ years experience in software engineering, systems administration, or a related field

• Proficient in one or more scripting languages (e.g. Python, Bash, etc.)

• Experience with distributed systems, microservices, and cloud architecture

• Expertise in networking and system architecture

• Ability to troubleshoot and debug complex distributed systems

• Experience with configuration management and automation tools (e.g. Ansible, Chef, Puppet, etc.)

• Experience with containerization (e.g. Docker, Kubernetes, etc.)

• Experience with monitoring/logging tools (e.g. Prometheus, Grafana, ELK, etc.)

• Experience with CI/CD pipelines

• Knowledge of database systems (e.g. PostgreSQL, Cassandra, etc.)

• Familiarity with security best practices

• Excellent problem-solving and communication skills


CI/CD Pipeline
Distributed Computing
Enjoy hiring without limits.

Create better job descriptions with our AI Job Description Generator

Save time - get better results.

Understand your reach with audience size estimation

Transparency everywhere.

Connect only with qualified Rezi users with integrated job postings

Matches made to interview.