Senior ML Ops Engineer

Description

Senior ML Ops Engineer

Overview

As a Senior ML Ops Engineer at Mimecast, you will be a technical leader on the AI Enablement Platform (AIP) team, responsible for ensuring that machine learning models and AI agents are deployed, scaled, observed, and maintained reliably across production environments. The AI Enablement Platform serves billions of requests per month across multiple regions, powering AI-driven capabilities in email security, insider risk, data loss prevention, and collaboration security for Mimecast's Human Risk Management platform.

This role sits at the intersection of infrastructure engineering and machine learning. You will own the design and implementation of self-service deployment tooling, platform resilience and scaling infrastructure, and operational best practices that enable ML Engineers and Data Scientists to ship models and agents independently, with confidence. You will also be responsible for building and maintaining the developer platform that accelerates the work of ML practitioners across the organization.

This is a senior individual contributor role. You are expected to drive architectural decisions, mentor other engineers, define standards, and operate with a high degree of autonomy. You will collaborate closely with ML Engineers, Software Engineers, SRE, and Cloud Platform teams.

AI-First Engineering at Mimecast

Mimecast is an AI-First engineering organization. Our teams actively leverage AI-powered development tools across all facets of engineering, from code development to testing, documentation, and operations. We're looking for leaders who don't just use AI tools but champion their adoption and establish new ways of working.

Our AI leadership extends beyond how we build to what we build. Our Mihra AI agent delivers 7x faster threat response for customers, and we're recognized as "Agents of Change" in Human Risk Management. Engineers here work at the intersection of cutting-edge AI tooling and AI-powered security products that protect organizations worldwide.

What You’ll Do:

Self-Service Deployment Tooling: Design and build config-driven, validated workflows that enable ML Engineers to deploy models to AIP infrastructure without requiring hands-on ML Ops involvement for each release. This includes automated validation pipelines, standardized configuration schemas, endpoint provisioning, and derisked rollout patterns (canary, blue-green, rollback).
Platform Resilience and Scaling: Own the reliability and scalability of ML inference infrastructure. Design and tune autoscaling policies against real production traffic patterns, implement rate limiting and backpressure mechanisms (HTTP 429, retry-after) at the API layer, and build request prioritization frameworks (real-time vs. batch) so the platform protects itself under load without manual intervention or consumer-side changes.
Observability and Monitoring: Develop and maintain the platform's observability stack (metrics, logging, tracing, alerting) so that monitoring is wired in by default for every deployed model and agent. Continuously monitor model performance, data drift, latency, error rates, and system health. Build dashboards and alerting that give both the AIP team and consuming teams visibility into their workloads. All team members, including leadership, participate in an on-call rotation (12 hour shifts).
CI/CD and Automation: Design, implement, and maintain robust CI/CD pipelines for ML model and infrastructure deployments. Automate testing (functional, integration, performance) as pre-deployment gates that ML Engineers can trigger themselves, with clear pass/fail criteria.
Infrastructure as Code: Manage all AIP infrastructure through Terraform and configuration management tooling. Maintain multi-region deployment capabilities and ensure infrastructure changes are reviewable, repeatable, and auditable.
Cost Optimization: Implement and enforce cost tagging and allocation at deployment time. Optimize ML inference endpoints for cost-effectiveness, including right-sizing instance types, managing reserved capacity, and providing opinionated endpoint configuration recommendations based on model characteristics.
Agent and LLM Operations: Support the deployment and operational management of AI agents and LLM-based capabilities within the AIP's templatized agent framework. This includes infrastructure for agent hosting, tool access configuration, and observability for agentic workloads.
Security and Compliance: Ensure ML systems adhere to security best practices, including input validation, authentication, network exposure controls, and automated security scanning for model configurations. Support compliance with regulatory requirements relevant to AI systems in the cybersecurity domain.
Technical Leadership: Mentor ML Ops and ML Engineers on operational best practices. Participate in architectural reviews, contribute to platform governance, and drive engineering standards through documentation, code reviews, and design discussions. Represent ML Ops in cross-functional planning with SRE, Cloud Platform, and consuming product teams.

What You’ll Bring:

Strong experience with AWS, particularly SageMaker (endpoints), EC2, ECS/EKS, SQS, S3, CloudWatch, and IAM.
Proficiency in Python, Java (Spring Boot), and Bash scripting.
Deep expertise with Infrastructure as Code (Terraform) and containerization (Docker, Kubernetes), including scaling policies, ConfigMaps, and resource management.
Strong experience designing and maintaining CI/CD pipelines (Jenkins or GitHub Actions preferred), including lifecycle management, automated testing and deployment gates for ML workflows.
Demonstrated ability to build and tune autoscaling, rate limiting, and traffic management systems for high-throughput, latency-sensitive services.
Solid understanding of observability tooling and practices (Grafana, CloudWatch, Open Telemetry) and experience building monitoring for ML model performance in production.
Familiarity with ML frameworks (PyTorch), ML lifecycle tools (MLflow, SageMaker), and model serving patterns (real-time inference, batch transform, async processing). Experience with Triton and ONNX preferred.
Experience working in multi-region, production-grade environments handling high request volumes.
Experience and enthusiasm using AI-assisted development tooling to accelerate your own work and the work of ML Engineers.
Comfortable operating as a technical authority across ML Engineering, Software Engineering, SRE, and product teams—influencing outcomes through expertise and trust, not org chart position

Preferred Qualifications

Experience building self-service developer platforms or internal tooling for ML/data teams.
Exposure to LLM serving infrastructure (model hosting, prompt management, token-level observability) and agentic AI deployment patterns.
Experience with cost allocation, FinOps, or cloud cost optimization for ML workloads.
Background in cybersecurity or experience operating AI systems in regulated or security-sensitive environments.
Familiarity with the Model Context Protocol (MCP) or similar agent-tool integration standards.

What We Bring:

Join our AI Enablement Platform (AIP) team to accelerate your career journey, working with cutting-edge technologies and contributing to projects that have real customer impact. You will be immersed in a dynamic environment that recognizes and celebrates your achievements.

Mimecast offers formal and ‘on the job’ learning opportunities, maintains a comprehensive benefits package that helps our employees and their family members to sustain a healthy lifestyle, and importantly - working in cross functional teams to build your knowledge!

Our Hybrid Model: We provide you with the flexibility to live balanced, healthy lives through our hybrid working model that champions both collaborative teamwork and individual flexibility. Employees are expected to come to the office at least two days per week, because working together in person:

Fosters a culture of collaboration, communication, performance and learning.
Drives innovation and creativity within and between teams.
Introduces employees to priorities outside of their immediate realm.
Ensures important interpersonal relationships and connections with one another and our community!

The base salary range for this position is $148,000−$222,000 plus benefits. This range represents the minimum and maximum new hire compensation for this role. The position may also be eligible for incentive plans and additional benefits, in accordance with company policy and local regulations. Our salary ranges are determined by role, level, and location with individual compensation also dependent on factors such as qualifications, experience, and skills. Final offers will reflect these considerations and may vary accordingly.

#LI-CS1

Belonging at Mimecast

Cybersecurity is a community effort. That’s why we’re committed to building an inclusive, diverse community that celebrates and welcomes everyone – unless they’re a cybercriminal, of course.

We’re proud to be an Equal Opportunity and Affirmative Action Employer, and we’d encourage you to join us whatever your background. We particularly welcome applicants from traditionally underrepresented groups.

We consider everyone equally: your race, age, religion, sexual orientation, gender identity, ability, marital status, nationality, or any other protected characteristic won’t affect your application.

If you require any adjustments or accommodations due to a disability, or any other reason that may help you in your interview process, please let us know by emailing careers@mimecast.com.

Due to certain obligations to our customers, an offer of employment will be subject to your successful completion of applicable background checks, conducted in accordance with local law.

It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment.