Senior Engineer -Site Reliability Engineering- Emirati Talent
Apply now »Date: 17 Sept 2025
Location: Abu Dhabi, AE
Company: EDGE Group PJSC
About KATIM
KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Electronic Warfare & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications.
The Senior SRE Engineer is responsible for ensuring the reliability, scalability, and performance of mission-critical systems and services. This role combines software engineering and operations expertise to automate processes, optimize infrastructure, and reduce toil. Acting as a bridge between development and operations, the Senior SRE Engineer drives continuous improvement in availability, observability, and incident response, while mentoring junior team members and promoting a culture of reliability across the organization.
Key Responsibilities:
- Design, implement, and maintain highly available, scalable, and resilient infrastructure and services.
- Develop automation frameworks and tools to improve deployment, monitoring, and operational processes.
- Lead incident response, root cause analysis (RCA), and implement permanent fixes to improve system reliability.
- Collaborate with development and infrastructure teams to embed reliability and performance best practices into the product lifecycle.
- Define and monitor SLOs/SLAs to ensure service quality and client satisfaction.
- Drive capacity planning, performance tuning, and cost optimization initiatives.
- Mentor junior engineers and contribute to knowledge sharing, standards, and documentation.
- Stay current with industry trends and emerging technologies to propose innovative solutions.
Experience and Education:
Bachelor's degree in Computer Science, Engineering, or a related field
- 7–10 years of overall IT experience, with at least 4–5 years in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
- 3+ years of Ops experience in a production, customer-facing environment
- Hands-on experience managing large-scale distributed systems and production environments.
- Proven experience in incident management, performance tuning, and capacity planning
- Strong expertise in Linux/Unix administration and scripting (Python, Bash, Go preferred).
- Proficiency with containerization and orchestration technologies (Docker, Kubernetes, Helm).
- Experience with cloud platforms (AWS, Azure, GCP) and on-prem hybrid environments.
- Knowledge of CI/CD pipelines and automation frameworks (Jenkins, GitLab CI, ArgoCD, Terraform, Ansible).
- Solid understanding of networking, security, and load balancing.
- Experience with observability stacks (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
- Database operations knowledge (PostgreSQL, MySQL, NoSQL)
Key Skills:
- Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
- Kubernetes Administrator (CKA) or Kubernetes Application Developer (CKAD).
- Cloud certifications (AWS Solutions Architect, Azure Administrator, or GCP Professional Cloud Engineer)
- Proven track record of working in Agile/Scrum environments and using tools like Jira and Confluence.
- Exceptional communication and collaboration skills, with the ability to work effectively in cross-functional teams.
#KATIM
Job Segment:
Application Developer, Cloud, Computer Science, Developer, Solution Architect, Technology