Principal Engineer - ML Ops

Apply now »

Date: 21 Nov 2025

Location: Abu Dhabi, AE

Company: EDGE Group PJSC

About KATIM

KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale. 

 

Job Purpose (specific to this role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.

You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.

You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.

 

AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.

 

Core Principles
•    Security is integrated into every decision, from architecture to deployment.
•    Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
•    Quality is measurable, enforced, and automated at every stage.
•    All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality.
•    Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.

 

Key Responsiblities

AI MLOps Architecture & Governance (30%)    

•    Define the MLOps architecture and governance framework across products.
•    Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
•    Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
•    Lead architectural designs and reviews for AI pipelines.
•    Design and maintain LLM inference infrastructure
•    Manage model registries and versioning (MLflow, Weights & Biases)
•    Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
•    Optimize model performance and cost (quantization, caching, batching)
•    Build and maintain vector databases (Pinecone, Weaviate, Chroma)
•    Hardware and inference optimization awareness

Agent & Tool Development (25%)    

•    Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
•    Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
•    Build tool integrations for LLM agents (function calling, APIs)
•    Implement retrieval-augmented generation (RAG) pipelines
•    Create prompt management and versioning systems
•    Monitor and optimize agent performance

CI/CT/CD Pipelines (20%)
•    Build continuous integration pipelines for models and code
•    Implement continuous training (CT) workflows
•    Automate model deployment with rollback capabilities
•    Create staging and production deployment strategies
•    Integrate AI-assisted code review into CI/CD
•    Building a continuous evaluation loop

Infrastructure & Automation (15%)    

•    Manage cloud infrastructure (Kubernetes, serverless)
•    Implement Infrastructure as Code (Terraform, Pulumi)
•    Build monitoring and observability systems (Prometheus, Grafana, DataDog)
•    Automate operational tasks with AI agents
•    Ensure security and compliance (OWASP, SOC2) - AI-specific security

Developer Enablement (10%)

•    Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
•    Document AI/ML best practices and patterns
•    Conduct training on MLOps tools and workflows
•    Support engineers with AI integration challenges
•    Maintain development environment parity
•    AI Privacy, Governance, and Compliance

 

Education and Minimum Qualification

•    BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred. 

•    8+ years in DevOps, SRE, or platform engineering
•    5+ years hands-on experience with ML/AI systems in production
•    Deep understanding of LLMs and their operational requirements
•    Experience building and maintaining CI/CD pipelines
•    Strong Linux/Unix systems knowledge
•    Cloud platform expertise (AWS, GCP, or Azure)
•    Experience with container orchestration (Kubernetes)

 

Key Skills

MLOps & AI:
•    LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
•    Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
•    Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
•    Model Registries: MLflow, Kubeflow, AWS SageMaker
•    Vector Databases: Pinecone, Weaviate, Chroma, Milvus
•    Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
•    Fine-tuning: LoRA, QLoRA, prompt tuning

Data Engineering:
•    Pipelines: Airflow, Prefect, Dagster
•    Processing: Spark, Dask, Ray
•    Streaming: Kafka, Pulsar, Kinesis
•    Data Quality: Great Expectations, dbt
•    Feature Stores: Feast, Tecton

DevOps & Infrastructure:
•    Containers: Docker, Kubernetes, Helm
•    Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
•    IaC: Terraform, Pulumi, CloudFormation
•    CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
•    Orchestration: Kubernetes operators, Kubeflow

Monitoring & Observability:
•    Metrics: Prometheus, Grafana, CloudWatch
•    Logging: ELK Stack, Loki, CloudWatch Logs
•    Tracing: Jaeger, Zipkin, OpenTelemetry
•    Alerting: PagerDuty, Opsgenie
•    Model Monitoring: Arize, Fiddler, Evidently

Programming:
•    Python: Primary language for ML/AI
•    Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
•    FastAPI, Flask for serving
•    Go: For high-performance services and tooling
•    Shell Scripting: Bash, Python for automation
•    SQL: Advanced queries, optimization

AI-Assisted Operations:
•    Autonomous agents for incident response
•    AI-powered log analysis and anomaly detection
•    Automated root cause analysis
•    Intelligent alerting and noise reduction

Other Highly Desirable Skills:
•    Experience with LLM fine-tuning and deployment at scale
•    Background in data engineering or ML engineering
•    Startup or high-growth environment experience
•    Security certifications (CISSP, AWS Security)
•    Contributions to open source MLOps projects
•    Experience with multi-cloud or hybrid cloud
•    Prior software engineering experience

Success Metrics
•    Uptime: 99.9%+ availability for AI services
•    Deployment Frequency: Daily or on-demand deployments
•    Model Performance: Latency (p95 < 500ms), accuracy tracking
•    Cost Efficiency: Cost per inference, infrastructure utilization
•    Developer Velocity: Time to deploy new models, AI feature adoption rate
•    Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)

 

#KATIM


Job Segment: Cloud, Testing, Computer Science, Open Source, SQL, Technology

Apply now »