Transform Documents into LLM Fine-tuning Datasets. A Python package designed to convert various document formats into high-quality training data.
Moves beyond simple parsing by using LLMs to intelligently extract structured knowledge.
Python-based pipeline with multiple extraction types (Q&A, rules, facts) and smart chunking.
Source attribution to track data lineage back to specific files, pages, and chunks.
Built-in quality filtering and scoring (0-1). Parallel processing with async support.
Quality control pipeline with length, repetition, and duplicate filters.
Built-in cost estimation and token counting before processing.