A comprehensive evaluation framework designed to benchmark the performance of document parsing systems, such as Unstructured, that leverage Visual Language Models (VLMs). SCORE (Structural and Content Robust Evaluation) is a new, interpretation-agnostic framework that addresses the limitations of applying traditional, deterministic metrics (like CER, WER, or TEDS) to generative document parsing. This repository illustrates how SCORE can be applied in practice to measure the quality of the outputs of generative document parsers.
| SCORE paper | Blog: "Benchmarking Document Parsing (and What Actually Matters)" |
Document parsing is the process of extracting structured information from unstructured documents. The emergence of multi-modal generative systems (like GPT-5 Mini or Gemini 2.5 Flash) has made traditional evaluation inadequate, as these systems often produce semantically correct yet structurally divergent outputs. Conventional metrics misclassify this diversity as an error, penalizing valid interpretations. The SCORE framework is designed to embrace this representational diversity while enforcing semantic rigor by advancing evaluation in four critical dimensions:
- Adjusted Edit Distance: Robust content fidelity that tolerates structural reorganization.
- Token-Level Diagnostics: Separation of content omissions from hallucinations.
- Table Evaluation: Semantic alignment with spatial tolerance for structural variations.
- Hierarchy-Aware Consistency: Assessment of document structure understanding.
This repository provides:
- A collection of document examples with labeled ground truth
- A suite of evaluation metrics for assessing parsing quality
- Scripts for calculating and aggregating these metrics
- A framework for benchmarking document parsing systems
The SCORE framework goes beyond simple text extraction metrics to evaluate structural understanding, element classification, table extraction, content organization, and hallucinations.
unstructured-eval-metrics/
├── configs/ # Configuration files for evaluation
│ └── all_metrics_config.json # Example configuration for all metrics
├── simple-data/ # Example documents and ground truth
│ ├── cct-gt/ # Ground truth for text content evaluation
│ ├── od-first-prediction/ # Example prediction outputs
│ ├── src/ # Example PDFs
│ └── table-gt/ # Ground truth for table evaluation
└── src/ # Source code
├── common/ # Common utilities
├── evaluation/ # Evaluation orchestration
├── persist/ # Data persistence utilities
├── processing/ # Data transformation functions
├── scoring/ # Performance metrics implementation
└── utils/ # General utility functions
The Unstructured internal evaluation dataset, designed to measure performance on real-world document complexity, includes diverse formats such as scanned invoices, technical manuals, and financial reports—cases that differentiate production-grade systems from research prototypes. However, a significant portion of the internal evaluation dataset consists of proprietary enterprise documents and cannot be publicly distributed.
To ensure transparency and reproducibility, we provide a few representative labeled examples in the simple-data/ directory:
- The original PDF files can be found under
simple-data/src. - The sample Unstructured outputs are located under
simple-data/od-first-prediction. - Ground truth labels are under
simple-data/cct-gtandsimple-data/table-gt.
You can use these examples to:
- Understand the evaluation methodology and data formats used to calculate SCORE metrics.
- Label your own documents following the same conventions.
- Run your own benchmarks and compare results using your data.
This approach allows you to replicate SCORE evaluation process and assess performance on datasets that reflect your own domain.
The repository uses the following types of ground truth data:
-
Text Content Ground Truth (in
simple-data/cct-gtdirectory):- Files use the format:
{original_filename}__uns-plaintext-v1.0.0__{id}.txt - Content is structured with markers for different document elements, enabling evaluation against a clean concatenated text representation (CCT).
--------------------------------------------------- Unstructured Plain Text Format 1 --------------------------------------------------- Unstructured Title Begin DOCUMENT TITLE --------------------------------------------------- Unstructured Title End --------------------------------------------------- Unstructured NarrativeText Begin Document content... --------------------------------------------------- Unstructured NarrativeText End
- Files use the format:
-
Table Ground Truth (in
simple-data/table-gtdirectory):- Files use the format:
{original_filename}__uns-table-json-v1.0.0__{id}.json - Tables are represented as JSON with cell coordinates and content, serving as the ground truth for our format-agnostic table evaluation
[ { "type": "Table", "text": [ { "id": "cell-id", "x": 0, "y": 0, "w": 1, "h": 1, "content": "Cell content" }, ... ] } ]
- Files use the format:
To label your own documents:
- Create a directory for your document set
- For text content:
- Create plaintext files with the appropriate markers for document elements
- Name files according to the convention:
{original_filename}__uns-plaintext-v1.0.0__{id}.txt
- For tables:
- Create JSON files with table structure and cell content
- Name files according to the convention:
{original_filename}__uns-table-json-v1.0.0__{id}.json
The SCORE framework includes a multi-dimensional suite of metrics to characterize performance beyond a single score.
These metrics focus on the accuracy of the extracted text content, adjusted to be robust against structural differences.
| Metric | Definition & Formula | Context | |
|---|---|---|---|
| CCT (Clean Concatenated Text) | Based on Normalized Edit Distance (NED) | Serves as the traditional baseline, measuring raw string-level similarity. It is known to fail when content is semantically equivalent but structurally reorganized. | |
| Adjusted CCT(Adj. NED) | Corrects the CCT/NED score by applying word-weighted fuzzy alignment across different elements (tables, paragraphs), recognizing semantic equivalence despite structural variation. This recovers equivalence between alternative but valid interpretations. | ||
| Percent Tokens Found(TokensFound) | Measures the recall of reference tokens preserved, independent of ordering. It is a direct indicator of content omission/loss. | ||
| Percent Tokens Added(TokensAdded) | Measures the proportion of spurious tokens generated. It serves as a direct indicator of the model's hallucination level. |
These metrics assess the system's ability to capture the organizational structure of the document.
| Metric | Context |
|---|---|
| Element Alignment(F1-based) | Measures how well predicted elements align with ground truth. It is largely a reality-grounded metric that captures how systems map heterogeneous labels into a coherent hierarchy. |
The framework employs a format-agnostic representation where all structured outputs (HTML, JSON, etc.) are mapped into equivalent tuples like
| Metric | Context |
|---|---|
| Detection F1 | Recasts table detection as a content-based classification problem, not a bounding-box matching task. This is critical for VLMs, which lack explicit object detection modules. |
| Content Accuracy | Measures the accuracy of the extracted cell text/content (text recognition). |
| Index Accuracy | Measures the accuracy of the spatial relationships (cell index/position). Evaluation incorporates spatial tolerance for positional shifts, ensuring systems aren't penalized for legitimate variations like merged vs. split headers. |
| TEDS | Tree Edit Distance-based Similarity. This traditional metric is integrated to evaluate hierarchical table structures. |
To benchmark a document parsing system:
- Process your documents with the parsing system to generate outputs
- Ensure outputs are in the expected format (JSON for structured parsing)
- Update the configuration file specifying:
- Metrics to calculate
- Directories for predictions and ground truth
- Format specifications
- Run the evaluation scripts to calculate metrics
- Analyze the results to identify strengths and weaknesses
- Python 3.11 or higher
- uv: https://docs.astral.sh/uv/getting-started/installation/
- Install dependencies using
uv:
make installThis will create a virtual environment and install all required dependencies specified in pyproject.toml.
- Performs a linting check
make check- Automatically fixes linting issues
make tidyThe scoring package implements performance metrics:
aggregate_scoring.py: Generates aggregated scores across multiple datapointscct_adjustment.py: Adjusts CCT (Clean Concatenated Text) metrics for fair comparisoncontent_scoring.py: Measures OCR quality and reading order accuracyelement_consistency_scoring.py: Evaluates consistency of element detection and classificationstructure_scoring.py: Assesses document structure and element alignmenttable_scoring.py: Evaluates table structure and content extraction accuracy
The processing package contains data transformation functions:
aggregate.py: Functions for constructing datapoints and aggregating data across documentsstructure_processing.py: Document structure processing and element extractiontable_processing.py: Extracting, parsing, and normalizing table structurestext_processing.py: Text extraction, normalization, and tokenization
The evaluation package provides the evaluation orchestration:
config_model.py: Pydantic-based configuration models for validationgenerate_agg_eval_metrics.py: Generates aggregate metrics across datasetsgenerate_doc_eval_metrics.py: Produces document-level evaluation metrics
To generate metrics for individual documents:
PYTHONPATH=./src python3 -m evaluation.generate_doc_eval_metrics \
--config configs/all_metrics_config.json \
--output_csv reports/all_metrics_report.csvTo generate aggregate metrics across all documents:
PYTHONPATH=./src python3 -m evaluation.generate_agg_eval_metrics \
--config configs/all_metrics_config.json \
--output_csv reports/all_agg_metrics_report.csvEvaluations are configured using JSON files:
{
"storage": {
"store_metrics": false // Storage is used to store metrics in S3. It is set to false by default.
},
"formats": {
"sample_formats": {
"cct": "v1",
"adjusted_cct": "v1",
"percent_tokens_found": "v1",
"percent_tokens_added": "v1",
"tables": "html",
"element_alignment": "v1",
"element_consistency": "v1"
},
"ground_truth_formats": {
"cct": "v1",
"adjusted_cct": "v1",
"percent_tokens_found": "v1",
"percent_tokens_added": "v1",
"tables": "text",
"element_alignment": "v1",
"element_consistency": "v1"
}
},
"datasets": [
{
"name": "simple-data",
"metrics": [
"cct",
"adjusted_cct",
"percent_tokens_found",
"percent_tokens_added",
"element_alignment",
"element_consistency"
],
"to_evaluate_dir": "./simple-data/od-first-prediction",
"ground_truth_dir": "./simple-data/cct-gt"
},
{
"name": "simple-data",
"metrics": [
"tables"
],
"to_evaluate_dir": "./simple-data/od-first-prediction",
"ground_truth_dir": "./simple-data/table-gt"
}
]
}After running the evaluation scripts, results will be saved to the specified CSV file. The results include:
- Document-level metrics for each document and metric type
- Aggregate metrics across all documents in the dataset
These metrics can be used to:
- Compare different document parsing systems
- Track improvements in parsing quality over time
- Identify specific areas for improvement, distinguishing between hallucination, omission, and structural variation.
This project is licensed under the Apache License, Version 2.0.