Job
- Level
- Experienced
- Job Field
- Software, Data
- Employment Type
- Full Time
- Contract Type
- Permanent employment
- Location
- Munich
- Working Model
- Onsite
Job Summary
In this role, you build the infrastructure for distributed training, deployment, and experimentation, utilizing technologies like Kubernetes and PyTorch to efficiently transition ML models into production.
Job Technologies
Your role in the team
- The AI Research Division of Agile Robots is looking for an ML Platform Engineer, who will build and operate the distributed training, deployment, and experimentation infrastructure that research, data, and robotics teams depend on to move models from prototype to production.
- Design and scale distributed training workflows for large models using tools such as PyTorch Distributed, DeepSpeed, and cluster schedulers like SLURM or Kubernetes.
- Build and maintain containerized ML environments that support reproducible experimentation and benchmarking.
- Develop and maintain CI/CD pipelines for machine learning systems to enable reliable testing, training, and deployment of models.
- Implement experiment tracking, model versioning, and reproducibility workflows using tools such as ClearML or Weights & Biases.
- Set up monitoring systems such as Prometheus and Grafana to track model performance and system health and detect drift in production.
- Work with research, data, and robotics teams to connect new models to robust production systems.
This text has been machine translated. Show original
Our expectations of you
Education
- Degree in Computer Science, Software Engineering, or a related field, with professional experience building and operating ML or software infrastructure in production.
Qualifications
- Vertrautheit mit Infrastructure-as-Code-Tools wie Terraform.
- Exposure to high-performance or distributed compute environments.
Experience
- Experience designing and operating distributed training systems on Kubernetes and Docker, using PyTorch Distributed, DeepSpeed, and schedulers such as SLURM.
- Experience building CI/CD pipelines that support reliable model testing, training, and deployment.
- Experience operating ML workloads on cloud infrastructure, preferably AWS.
- Hands-on experience with experiment tracking and model versioning using tools such as MLflow or Weights & Biases.
- Experience with monitoring and drift detection using tools such as Prometheus and Grafana.
- Python and system design skills, with experience building and operating ML systems beyond the prototype stage.
- Experience with large-scale or multimodal ML systems such as vision-language-action models.
- Experience with ML pipeline and orchestration tools.
This text has been machine translated. Show original
What we offer
- Dynamisches High-Tech-Unternehmen, verbunden mit finanzieller Solidität und Investoren von Weltklasse.
- Join an interdisciplinary, international team with 60+ different nationalities in a collaborative work environment.
- Lots of development opportunities in the context of our continued growth.
- Challenging tasks and impactful projects alongside experts that enable professional and personal growth.
- Corporate Benefits Program that covers health, mobility and learning with 100 € net per month.
- Modern office facilities with a rooftop terrace overlooking Munich, free drinks & fruits, and regular company events contribute to a good working environment.
This text has been machine translated. Show original
Benefits
Health, Fitness & Fun
Work-Life-Integration
Topics that you deal with on the job
Job Locations
This is your employer
Agile Robots Ag
Agile Robots SE, founded by leading robotics researchers, focuses on the development of AI-controlled robots and has established itself as a pioneer in automation.
Description
- Company Type
- Established Company
- Working Model
- Onsite
- Industry
- Electronics, Automatization