Custom Japanese SLM MLOps Pipeline Development

Global Consulting Firm | Internal Project

Overview

Designed and built an end-to-end MLOps infrastructure based on NVIDIA Data Flywheel Blueprint, enabling autonomous learning and improvement cycles for distilled models using production prompt/response logs. Implemented a model optimization pipeline capable of reducing inference costs by up to 98.6% leveraging the NeMo Microservices platform.

Architecture

Efficiently managed GPU resources by dynamically starting and stopping NIMs as needed using NeMo Deployment Manager. Implemented background task processing with Celery workers and task queue management with Redis.

Prompt/response log collection and storage via Elasticsearch
Task classification and deduplication based on workload_id
Automatic generation of evaluation and training datasets using class-aware stratified splitting
Fine-tuning with NeMo Customizer (LoRA)
Evaluation with NeMo Evaluator using LLM-as-judge comparisons

Experiment Types

Scored using LLM-as-judge similarity metrics [0,1], automatically extracting high-scoring small models as candidates.

Base: Replay of production prompts
ICL (In-Context Learning): Few-shot prompt construction from traffic examples (random selection or semantic similarity-based)
Customized: Base prompt evaluation after LoRA fine-tuning

Infrastructure

Built container orchestration with Kubernetes/Helm in an environment equipped with H100 and A100 GPUs. Achieved robust data infrastructure with FastAPI-based REST API, MongoDB (metadata), Elasticsearch (logs), and Redis (task queue).

Built input data quality visualization dashboards with Kibana, establishing an environment for quantitatively evaluating and promoting continuous improvement cycles for data and models. Maximized development efficiency by introducing uv and Brev on Google Cloud.

Technologies

PythonNVIDIA NeMoNVIDIA NIMKubernetesHelmFastAPICeleryElasticsearchMongoDBRedisKibanaLoRAH100/A100Google Clouduv