Business Analysis Table Auto-Generation AI Pipeline

Global Consulting Firm

Overview

Designed and built an AI pipeline that streamlines client business analysis by parsing heterogeneous manuals (PDF / Word / Excel / PowerPoint) and producing a 3-layer 'Business Analysis Table' as CSV. A reproducible experiment harness lets prompt iterations run with a single command, and emitting a business-facing grouping axis alongside a technical-facing classification axis lets upstream review and downstream engineering share one artifact.

Architecture

`src/core/` is split into four stages — Converter, Extractor, Classifier, Grouper — each runnable and swappable in isolation. The split deliberately separates LLM-driven work (extraction, classification), where output drift is expected, from deterministic work (conversion, grouping). Every stage boundary is guarded by Pydantic v2 schemas (`WorkProcedureList`, `ClassificationTaskList`, `TaskGroupList`) so Gemini output drift is absorbed at a typed seam.

Large PDFs are processed with a custom chunking strategy: PyMuPDF splits documents into 15-page chunks, each chunk is converted to Markdown in parallel, and the results are merged back in page order. Gemini calls are throttled with `asyncio.Semaphore(50)`, running up to 50 concurrent requests while staying inside rate limits.

The frontend deliberately avoids a heavy SPA: FastAPI + Jinja2 + htmx + Tailwind + Alpine.js + Mermaid.js stays lightweight, and pipeline progress is streamed over Server-Sent Events so long-running runs never freeze the UI. Central configuration lives in a single `src/config.py` so model, prompt-path, and output-dir swaps are a one-line change, enabling parallel experiments.

Converter: PDF / Word / Excel / PowerPoint → Markdown conversion (LibreOffice soffice + PyMuPDF)
Extractor: Pulls business procedures out of Markdown and produces structured tables
Classifier: Detailed technical classification (first half of Two-Stage classification)
Grouper: User-facing behavior grouping (second half of Two-Stage classification)
Runner: `src/web/services/pipeline_runner.py` drives each stage asynchronously and streams progress over SSE

Key Features

Two-Stage Classification: technical classification and business grouping are run as separate stages so humans can verify intermediate results before moving on — this avoids the accuracy drop and blurred responsibility that happens when a single prompt is asked to do both at once.

Experiment harness: `config.py` and the prompt template set are versioned together, so combinations of prompts and models can be executed in a single batch. Intermediate files and the final CSV land in an experiment-ID directory so runs can be laid out side-by-side for comparison.

Document fidelity: Office files are rendered to PDF through LibreOffice with Japanese fonts explicitly pinned to prevent mojibake; each CSV row carries the source file path and source page ID so every output row is traceable back to its origin.

Cloud integration: Azure Blob Storage input/output scripts expose a single interface that works the same in local dev and cloud execution.

Development & Quality

Pins dependencies with Python 3.12 + uv. Google Python Style Guide docstrings are applied across the codebase, and Ruff enforces both linting and formatting. Docker Compose defines a hot-reload dev environment that ships unchanged into production containers.

Technologies

Python 3.12FastAPIGoogle GeminiLangGraphPydantic v2PyMuPDFLibreOfficeasyncio.SemaphoreServer-Sent EventsJinja2htmxTailwind CSSAlpine.jsMermaid.jsAzure Blob StorageDocker ComposeuvRuff