Principal Engineer - ML Ops
Apply now »Date: 21 Nov 2025
Location: Abu Dhabi, AE
Company: EDGE Group PJSC
About KATIM
KATIM is a leader in the development of innovative secure communication products and solutions for governments and businesses. As part of the Space & Cyber Technologies cluster at EDGE, one of the world’s leading advanced technology groups, KATIM delivers trust in a world where cyber risks are a constant threat, and fulfils the increasing demand for advanced cyber capabilities by delivering robust, secure, end-to-end solutions centered on four core business units: Networks, Ultra Secure Mobile Devices, Applications, and Satellite Communications. Our talented team of cross functional experts continually takes-on new challenges. We work with the energy of a start-up yet the discipline of a large business to make solutions and products work for our customers at scale.
Job Purpose (specific to this role)
The Principal MLOps Engineer / AI Infrastructure Architect defines and leads the design, automation, and governance of KATIM’s AI infrastructure powering mission-critical, secure communications products. This role drives end-to-end MLOps strategy — from model governance and deployment automation to compliance enforcement — ensuring every AI capability adheres to zero-trust and sovereign-data principles. This role bridges applied machine learning, software engineering, and DevSecOps — ensuring that AI models, pipelines, and applications are deployed, monitored, and scaled securely across environments.
You will operate in an AI-augmented product development environment, where infrastructure provisioning, model training, deployment, and monitoring are automated or AI-assisted. Your mission is to build systems that empower product teams to deliver production-grade AI capabilities with the speed, reliability, and compliance standards required by government and enterprise clients.
You will collaborate closely with cross-functional teams across platform engineering, product security, and product management to design MLOps pipelines that embody resilience, traceability, and zero-trust security principles. This role is ideal for engineers passionate about building intelligent automation frameworks, scalable AI pipelines, and secure data infrastructures that form the foundation of mission-critical AI systems.
AI-Augmented Product Development Model (Context for the Role)
We are a lean, AI-augmented product development team delivering production-ready, enterprise-grade secure applications with the quality and velocity typically requiring 3–4x larger organizations. Every role in the traditional SDLC is either AI-augmented or automated, allowing teams to focus on design, architecture, and high-impact decision-making.
Core Principles
• Security is integrated into every decision, from architecture to deployment.
• Repetitive tasks are automated; human effort is focused on strategy and problem-solving.
• Quality is measurable, enforced, and automated at every stage.
• All system behaviors—including AI-assisted outputs—must be traceable, reviewable, and explainable. We do not ship “black box” functionality.
• Design and implementation decisions prioritize resilience, maintainability, and interoperability in real customer environments.
Key Responsiblities
AI MLOps Architecture & Governance (30%)
• Define the MLOps architecture and governance framework across products.
• Design secure, scalable AI platform blueprints covering data, training, serving and monitoring layers.
• Standardize model registries, artifact signing, and deployment processes for air-gapped and on-prem environments.
• Lead architectural designs and reviews for AI pipelines.
• Design and maintain LLM inference infrastructure
• Manage model registries and versioning (MLflow, Weights & Biases)
• Implement model serving solutions (TensorFlow Serving, TorchServe, vLLM)
• Optimize model performance and cost (quantization, caching, batching)
• Build and maintain vector databases (Pinecone, Weaviate, Chroma)
• Hardware and inference optimization awareness
Agent & Tool Development (25%)
• Develop and maintain AI agents for automation (e.g., automated test generation, log intelligence, and anomaly detection).
• Build AI-assisted DevSecOps utilities to automatically enforce compliance, logging, and audit policies.
• Build tool integrations for LLM agents (function calling, APIs)
• Implement retrieval-augmented generation (RAG) pipelines
• Create prompt management and versioning systems
• Monitor and optimize agent performance
CI/CT/CD Pipelines (20%)
• Build continuous integration pipelines for models and code
• Implement continuous training (CT) workflows
• Automate model deployment with rollback capabilities
• Create staging and production deployment strategies
• Integrate AI-assisted code review into CI/CD
• Building a continuous evaluation loop
Infrastructure & Automation (15%)
• Manage cloud infrastructure (Kubernetes, serverless)
• Implement Infrastructure as Code (Terraform, Pulumi)
• Build monitoring and observability systems (Prometheus, Grafana, DataDog)
• Automate operational tasks with AI agents
• Ensure security and compliance (OWASP, SOC2) - AI-specific security
Developer Enablement (10%)
• Provide tools and libraries for engineers to adopt AI-augmented workflows securely.
• Document AI/ML best practices and patterns
• Conduct training on MLOps tools and workflows
• Support engineers with AI integration challenges
• Maintain development environment parity
• AI Privacy, Governance, and Compliance
Education and Minimum Qualification
• BS in Computer Science, Computer Engineering, Mathematics or related field, or equivalent experience in industry / Masters preferred.
• 8+ years in DevOps, SRE, or platform engineering
• 5+ years hands-on experience with ML/AI systems in production
• Deep understanding of LLMs and their operational requirements
• Experience building and maintaining CI/CD pipelines
• Strong Linux/Unix systems knowledge
• Cloud platform expertise (AWS, GCP, or Azure)
• Experience with container orchestration (Kubernetes)
Key Skills
MLOps & AI:
• LLM Integration: OpenAI API, Anthropic API, HuggingFace, Azure OpenAI
• Model Serving: TensorFlow Serving, TorchServe, vLLM, Ollama
• Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
• Model Registries: MLflow, Kubeflow, AWS SageMaker
• Vector Databases: Pinecone, Weaviate, Chroma, Milvus
• Agent Frameworks: LangChain, LlamaIndex, AutoGPT, Semantic Kernel
• Fine-tuning: LoRA, QLoRA, prompt tuning
Data Engineering:
• Pipelines: Airflow, Prefect, Dagster
• Processing: Spark, Dask, Ray
• Streaming: Kafka, Pulsar, Kinesis
• Data Quality: Great Expectations, dbt
• Feature Stores: Feast, Tecton
DevOps & Infrastructure:
• Containers: Docker, Kubernetes, Helm
• Cloud Platforms: AWS (SageMaker, Lambda, ECS) OR GCP (Vertex AI, Cloud Run) OR Azure (ML Studio)
• IaC: Terraform, Pulumi, CloudFormation
• CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
• Orchestration: Kubernetes operators, Kubeflow
Monitoring & Observability:
• Metrics: Prometheus, Grafana, CloudWatch
• Logging: ELK Stack, Loki, CloudWatch Logs
• Tracing: Jaeger, Zipkin, OpenTelemetry
• Alerting: PagerDuty, Opsgenie
• Model Monitoring: Arize, Fiddler, Evidently
Programming:
• Python: Primary language for ML/AI
• Libraries: NumPy, Pandas, PyTorch/TensorFlow, scikit-learn
• FastAPI, Flask for serving
• Go: For high-performance services and tooling
• Shell Scripting: Bash, Python for automation
• SQL: Advanced queries, optimization
AI-Assisted Operations:
• Autonomous agents for incident response
• AI-powered log analysis and anomaly detection
• Automated root cause analysis
• Intelligent alerting and noise reduction
Other Highly Desirable Skills:
• Experience with LLM fine-tuning and deployment at scale
• Background in data engineering or ML engineering
• Startup or high-growth environment experience
• Security certifications (CISSP, AWS Security)
• Contributions to open source MLOps projects
• Experience with multi-cloud or hybrid cloud
• Prior software engineering experience
Success Metrics
• Uptime: 99.9%+ availability for AI services
• Deployment Frequency: Daily or on-demand deployments
• Model Performance: Latency (p95 < 500ms), accuracy tracking
• Cost Efficiency: Cost per inference, infrastructure utilization
• Developer Velocity: Time to deploy new models, AI feature adoption rate
• Incident Response: MTTD (Mean Time to Detect), MTTR (Mean Time to Resolve)
#KATIM
Job Segment:
Cloud, Testing, Computer Science, Open Source, SQL, Technology