doc2dataset

Transform Documents into LLM Fine-tuning Datasets. A Python package designed to convert various document formats into high-quality training data.

The Challenge

Moves beyond simple parsing by using LLMs to intelligently extract structured knowledge.

Python-based pipeline with multiple extraction types (Q&A, rules, facts) and smart chunking.

Source attribution to track data lineage back to specific files, pages, and chunks.

Built-in quality filtering and scoring (0-1). Parallel processing with async support.

Quality control pipeline with length, repetition, and duplicate filters.

Built-in cost estimation and token counting before processing.

Next Project