Job Description: Senior MLOps Engineer

Trivandrum

in 11 days

Brief DescriptionWe are building a strong DevOps, MLOps, and DataOps team to test, validate, and deploy production-grade Computer Vision and AI pipelines. The team's mission is to handle pipelines running on a wide range of hardware, from milliwatt edge devices like Jetson Orin and Hailo-8/15 to multi-GPU data-center nodes. As a member of this team, you will need to master DevOps, MLOps, and DataOps while maintaining hardware awareness. This is a senior-level position, and you will report to the CTO. Purpose of the Role The primary purpose of this role is to take the AI/ML/Computer-Vision artifacts produced by the core R&D group and guide them to production. This involves: ● Functional testing, including unit, integration, regression, bias, fairness, and explainability tests. ● Validation against model-specific and pipeline-specific KPIs such as mAP, precision-recall, perplexity, latency, and throughput. ● Owning the path to production, which includes containerization, CI/CD, cloud deployment, monitoring, auto-retraining, and decommissioning. Role Focus and Time Allocation ● Focus: Your primary focus will be on architecture, cross-platform CI/CD, mentoring, and both cloud/data-center and edge deployment. ● Time Allocation: ○ Deployment & Ops: 70% ○ Testing: 30% Responsibilities The responsibilities for this senior role include, but are not limited to, the following areas with an expectation of high depth and rigor: ● Build & Packaging: ○ Edge: Work with TensorRT, ONNX, and Hailo NN-Converter. ○ Server / Data-Center: Handle Yocto / Ubuntu Core images. ○ Cross-Cutting: Manage size-constrained Docker/OCI containers, multi-arch containers (x86_64, ARM), package training and testing environments, and maintain Git-versioned artifacts and SBOM. ● Functional & Model Validation: ○ Edge: Oversee model quantization and calibration for INT8. ○ Server / Data-Center: Implement statistical parity and fairness tests. ○ Cross-Cutting: Manage regression datasets in DVC and validate metrics like mAP@0.5 and precision-recall. ● Performance & Power Testing: ○ Edge: Conduct Jetson power profiling and thermal throttling tests. ○ Server / Data-Center: Run NCCL/NVLink bandwidth tests and Roofline & PCIe saturation analysis. ○ Cross-Cutting: Ensure adherence to latency SLOs and throughput targets.Preferred SkillsSkills We are looking for candidates with the following skills. While this list is optimistic, we encourage skill development on the job. Familiarity with performance testing and Docker/Kubernetes deployments is required. ● Modern C++ (17/20/23): Expertise in SIMD intrinsics, RAII, and zero-copy IPC. ● AI Runtimes – C++ APIs: ○ ONNX Runtime (C++): Experience with custom EPs, session options, and IO-binding. ○ TensorRT (C++): Knowledge of plugins, IBuilderConfig, dynamic shapes, and polygraphy. ○ NVIDIA Triton: Ability to configure ensemble configs, BLS backends, and custom C++ backends. ○ DeepStream (C/C++): Capable of writing basic GStreamer pipelines. ● Edge & Embedded Tool-chain: ○ Jetson (L4T): Proficient with Jetpack, cross-compilation, nvpmodel, and jtop. ○ Hailo-8/15: Experience with HEF compilation, PCIe driver, and the C++ API. ○ Intel ARC / OpenVINO: Skilled in using the OpenVINO C++ API, DLStreamer, and GPU plugin custom kernels. ● Build & Packaging (C++): ○ CMake: Can handle toolchain files, FetchContent, and presets. ○ Conan / vcpkg: Able to maintain a private registry and define build policies. ○ Cross-compilation: Proficient with aarch64-linux-gnu and Yocto SDK. ● Performance & Profiling: ○ CPU: Skilled with perf, VTune, and uprof. ○ GPU: Experienced with Nsight Systems/Compute, rocprof, and Intel GPA. ● Testing C++ Pipelines: ○ Unit / Integration: Proficient with GoogleTest + GMock, Catch2, and approval tests. ○ Regression & KPI: Can maintain golden datasets (DVC), understands various model and pipeline KPIs, and can write code for custom metrics in Python/C++. ● DataOps & Telemetry: ○ gRPC / ZeroMQ C++: Able to build an async server and work with protobuf. ○ Logging & Tracing: Experienced with spdlog, fmt, and OpenTelemetry C++. ● Security & Hardening: ○ Static analysis: Can run clang-tidy, cppcheck, and define CodeQL rules. ○ Runtime hardening: Understands and uses various compiler flags (clang++/g++) for security. Experience: ● 5+ years in C++ systems. ● 3+ years in Computer Vision runtime optimization. ● At least one shipped edge product.