Site Reliability Engineer
Job Descriptions
Hiring a
Site Reliability Engineer
?
1
2
3
Site Reliability Engineer
Description
We are looking for a Site Reliability Engineer to join our team. The ideal candidate should have a passion for reliability, automation, and scalability and have experience with CI/CD pipeline, distributed computing and databases. Your main responsibilities will include developing and maintaining reliable systems, monitoring performance and improving scalability, and supporting the development team by providing technical expertise. You will be responsible for the health and performance of our systems and infrastructure, and you will be expected to identify, diagnose and proactively address issues. Additionally, you will be expected to develop and maintain automation for our services and operations. This is an exciting opportunity to join an innovative team and build high-quality, reliable systems.
Responsibilities
• As a Site Reliability Engineer at Rezi, responsible for developing and maintaining the tools, processes and technologies needed to keep our systems running reliably and efficiently
• Develop and maintain monitoring systems to ensure services are running and performing as expected
• Diagnose and resolve production issues quickly to minimize impact to services
• Proactively identify and address potential system failures before they occur
• Automate tasks to reduce manual work, improve system reliability and reduce operational costs
• Design and implement solutions to improve system performance, scalability and security
• Collaborate with software engineers and other teams to ensure that services are designed to meet operational requirements and best practices
• Work with developers to ensure that applications are deployed and configured for optimal performance and reliability
Requirements
• 5+ years of experience in software engineering, DevOps and/or Site Reliability Engineering.
• Proficient in at least one scripting language (Python, Bash, etc.) and one programming language (Go, Java, etc.)
• Knowledge of orchestration tools (Kubernetes, DC/OS, Swarm, etc.) and their internals
• Experience with automation/configuration management (Terraform, Ansible, Chef, Puppet, etc.)
• Ability to debug and optimize code and automate routine tasks
• Experience with monitoring and alerting tools (Prometheus, Grafana, etc.)
• Hands on experience with containerization (Docker, rkt, etc.)
• Knowledge of networking protocols and concepts (TCP/IP, DNS, HTTP, etc.)
• Experience with cloud computing services (AWS, GCP, etc.)
• Strong knowledge of Linux/Unix operating systems internals
Skills
Site Reliability Engineer
Description
We are looking for a Site Reliability Engineer to join our team. The successful candidate will be responsible for managing and troubleshooting our production systems and ensuring optimal performance. This is a critical role that requires excellent problem-solving skills and a deep understanding of CI/CD Pipeline, Distributed Computing, and Database technologies. You will be responsible for developing and maintaining automated processes to monitor the health and performance of our systems, as well as utilizing software to detect and diagnose system outages. Additionally, you will collaborate with other engineers and technical teams to ensure our systems are running smoothly and securely.To be successful in this role, you must have experience implementing and maintaining CI/CD pipelines, strong knowledge of distributed computing principles, and a deep understanding of database technologies. You should also be able to troubleshoot and identify potential risks and develop strategies to mitigate them. Above all, you should be a problem solver with a passion for automation and a commitment to excellence.
Responsibilities
• As a Site Reliability Engineer at Rezi, responsible for developing software to automate and optimize operations and development
• Design and implement automated infrastructure, tools, and processes for testing, deploying, and monitoring software
• Develop and maintain monitoring, alerting and logging systems to ensure high availability of services
• Design and implement disaster recovery plans, and ensure reliability and scalability of services
• Coordinate with development and product teams to design and implement system architecture and best practices
• Develop and maintain automation scripts to manage and configure system resources
• Analyze and troubleshoot performance issues, identify root cause and resolve problems
• Collaborate with software engineering teams to ensure optimal system performance and scalability
Requirements
• 5+ years experience in software engineering, systems administration, or a related field
• Proficient in one or more scripting languages (e.g. Python, Bash, etc.)
• Experience with distributed systems, microservices, and cloud architecture
• Expertise in networking and system architecture
• Ability to troubleshoot and debug complex distributed systems
• Experience with configuration management and automation tools (e.g. Ansible, Chef, Puppet, etc.)
• Experience with containerization (e.g. Docker, Kubernetes, etc.)
• Experience with monitoring/logging tools (e.g. Prometheus, Grafana, ELK, etc.)
• Experience with CI/CD pipelines
• Knowledge of database systems (e.g. PostgreSQL, Cassandra, etc.)
• Familiarity with security best practices
• Excellent problem-solving and communication skills