Custom Japanese SLM MLOps Pipeline Development

Global Consulting Firm | Internal Project

Overview

Forked NVIDIA AI Blueprints' Data Flywheel (`NVIDIA-AI-Blueprints/data-flywheel`) and extended it into an autonomous-learning MLOps stack for Japanese SLMs, designed and built end-to-end. Production prompt / response logs kick off a loop that auto-generates evaluation data, runs LoRA fine-tuning, evaluates with LLM-as-judge, and promotes candidate models — all on NeMo Microservices — reproducing NVIDIA's published benchmark of up to 98.6% inference cost reduction.

Architecture

A FastAPI REST API receives jobs and a Celery + Redis fleet writes tasks into MongoDB (metadata) and Elasticsearch (logs, inference traces) asynchronously. NIMs are spun up and torn down through NeMo Deployment Manager so GPUs are billed only during fine-tuning and evaluation.

Orchestration runs on Kubernetes + Helm. The Celery Parent / Worker deployments were retuned from the upstream blueprint to improve fault recovery and horizontal scalability. Long-running jobs cancel cleanly via `FlywheelCancelledError`, preserving partial artifacts. MLflow captures experiment metrics while Flower visualizes Celery state on a separate plane.

Base: replay production prompts verbatim to capture a baseline
ICL (In-Context Learning): pick few-shot examples from traffic in either uniform or semantic-similarity mode
Customized: evaluate base prompts against the LoRA fine-tuned model
LLM-as-judge: paired comparison against a reference model emits a [0, 1] score
Class-aware stratified splitting: automatically builds evaluation and training sets balanced across task classes

Differences from Upstream

On the Helm side, Elasticsearch / Kibana / MongoDB / Redis were moved out of the upstream chart and treated as externally managed services, so the existing production data platform can be reused while only the Blueprint layer is swapped — this also cleans up the handover path to the ops team.

`config-configmap.yaml` overrides models, prompts, and thresholds for Japanese SLMs; keeping it under a distinct name from the upstream `dfw-configmap.yaml` lets fork-side changes be tracked continuously with `git diff upstream/main`.

A new `NeMo_Data_Designer.md` documents Japanese-specific tokenization, keigo / plain-form handling, and domain vocabulary choices so the data-design decisions live in the repo, not in people's heads.

Infrastructure & Operations

LoRA fine-tuning with NeMo Customizer and automated evaluation with NeMo Evaluator run on H100 and A100 hardware. Production logs accumulated in Elasticsearch are classified and deduplicated by `workload_id` before being rolled into evaluation / training sets, keeping the improvement loop tight.

Kibana dashboards visualize input data distribution and evaluation-score time series so data quality and model performance can be tracked quantitatively in one place. Brev on Google Cloud is used for development, letting GPU environments be spun up and torn down in minutes.

Technologies

PythonNVIDIA NeMo MicroservicesNVIDIA NIMNeMo Customizer (LoRA)NeMo EvaluatorFastAPICeleryRedisMongoDBElasticsearchKibanaMLflowFlowerKubernetesHelmH100/A100Google Cloud (Brev)uv