Staff AI Engineer, Inference & Optimization
About Sonatus
Headquartered in Sunnyvale, CA, with 250+ employees worldwide, Sonatus combines the agility of a fast-growing company with the scale and impact of an established partner. Backed by strong funding and proven by global deployment, we’re solving some of the most interesting and complex challenges in the industry. Join us and help redefine what’s possible as we shape the future of mobility.
About the Role
Qualifications
Proven experience with inference optimization techniques such as quantization (INT8, FP16), pruning, and model distillation.
Deep hands-on experience with hardware acceleration for machine learning, including familiarity with GPUs, TPUs, NPUs and related software ecosystems.
Strong experience with AI compilers and runtime environments like TensorRT, OpenVINO, and TVM.
Proven experience deploying and managing ML models on edge devices (e.g., NVIDIA Jetson, Raspberry Pi, NXP, Renesas).
Strong experience in designing and building distributed systems. Proficiency with inter-process communication protocols like gRPC, message queuing systems like MQTT, and efficient data handling techniques such as buffering and callbacks.
Hands-on experience with popular ML frameworks such as PyTorch, TensorFlow, TFLite, and ONNX.
Proficiency in programming languages, including Python and C++.
Solid understanding of machine learning concepts, the ML development lifecycle, and the challenges of deploying models at scale.
Proficiency with containerization technologies (Docker, Kubernetes) and cloud platforms (AWS, Azure).
Expertise in CI/CD principles and tools applied to machine learning workflows.
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related quantitative field.
Responsibilities
Collaborate with researchers and hardware engineers to optimize models for performance, latency, and power consumption on specific hardware, including GPUs, TPUs, NPUs, and FPGAs. This includes a strong focus on inference optimization techniques like quantization, pruning, and knowledge distillation.
Use of AI compilers and specialized software stacks (e.g., TensorRT, OpenVINO, TVM) to accelerate model execution, ensuring models are compiled and optimized for peak performance on target hardware.
Design, build, and maintain MLOps pipelines for deploying models to various edge devices (e.g., highly integrated vehicle compute), with a specific focus on performance and efficiency constraints.
Implement and maintain monitoring and alerting systems to track model performance, data drift, and overall model health in production.
Work with cloud platforms and on-device environments to provision and manage the necessary infrastructure for scalable and reliable model serving.
Proactively identify and resolve issues related to model performance, deployment failures, and data discrepancies, with a specific focus on inference bottlenecks.
Work closely with Machine Learning Engineers, Software Engineers, and Product Managers to bring models from design to high-performance production systems.
Benefits
Health care plan (Medical, Dental & Vision).
Retirement plan (401k, IRA).
Life Insurance (Basic, Voluntary & AD&D).
Unlimited paid time off (Vacation, Sick & Public Holidays).
Family leave (Maternity, Paternity).
Flexible work arrangements.
Free food & snacks in office.
KiloClaw - Managed 🦀 
