Unstructured Evaluation Metrics (SCORE Framework)

A comprehensive evaluation framework designed to benchmark the performance of document parsing systems, such as Unstructured, that leverage Visual Language Models (VLMs). SCORE (Structural and Content Robust Evaluation) is a new, interpretation-agnostic framework that addresses the limitations of applying traditional, deterministic metrics (like CER, WER, or TEDS) to generative document parsing. This repository illustrates how SCORE can be applied in practice to measure the quality of the outputs of generative document parsers.

| SCORE paper | Blog: "Benchmarking Document Parsing (and What Actually Matters)" |

Overview

Document parsing is the process of extracting structured information from unstructured documents. The emergence of multi-modal generative systems (like GPT-5 Mini or Gemini 2.5 Flash) has made traditional evaluation inadequate, as these systems often produce semantically correct yet structurally divergent outputs. Conventional metrics misclassify this diversity as an error, penalizing valid interpretations. The SCORE framework is designed to embrace this representational diversity while enforcing semantic rigor by advancing evaluation in four critical dimensions:

Adjusted Edit Distance: Robust content fidelity that tolerates structural reorganization.
Token-Level Diagnostics: Separation of content omissions from hallucinations.
Table Evaluation: Semantic alignment with spatial tolerance for structural variations.
Hierarchy-Aware Consistency: Assessment of document structure understanding.

This repository provides:

A collection of document examples with labeled ground truth
A suite of evaluation metrics for assessing parsing quality
Scripts for calculating and aggregating these metrics
A framework for benchmarking document parsing systems

The SCORE framework goes beyond simple text extraction metrics to evaluate structural understanding, element classification, table extraction, content organization, and hallucinations.

Repository Structure

unstructured-eval-metrics/
├── configs/                     # Configuration files for evaluation
│   └── all_metrics_config.json  # Example configuration for all metrics
├── simple-data/                 # Example documents and ground truth
│   ├── cct-gt/                  # Ground truth for text content evaluation
│   ├── od-first-prediction/     # Example prediction outputs
│   ├── src/                     # Example PDFs
│   └── table-gt/                # Ground truth for table evaluation
└── src/                         # Source code
    ├── common/                  # Common utilities
    ├── evaluation/              # Evaluation orchestration
    ├── persist/                 # Data persistence utilities
    ├── processing/              # Data transformation functions
    ├── scoring/                 # Performance metrics implementation
    └── utils/                   # General utility functions

Data

The Unstructured internal evaluation dataset, designed to measure performance on real-world document complexity, includes diverse formats such as scanned invoices, technical manuals, and financial reports—cases that differentiate production-grade systems from research prototypes. However, a significant portion of the internal evaluation dataset consists of proprietary enterprise documents and cannot be publicly distributed.

To ensure transparency and reproducibility, we provide a few representative labeled examples in the simple-data/ directory:

The original PDF files can be found under simple-data/src.
The sample Unstructured outputs are located under simple-data/od-first-prediction.
Ground truth labels are under simple-data/cct-gt and simple-data/table-gt.

You can use these examples to:

Understand the evaluation methodology and data formats used to calculate SCORE metrics.
Label your own documents following the same conventions.
Run your own benchmarks and compare results using your data.

This approach allows you to replicate SCORE evaluation process and assess performance on datasets that reflect your own domain.

Ground Truth Format

The repository uses the following types of ground truth data:

Text Content Ground Truth (in simple-data/cct-gt directory):

Files use the format: {original_filename}__uns-plaintext-v1.0.0__{id}.txt

Content is structured with markers for different document elements, enabling evaluation against a clean concatenated text representation (CCT).

--------------------------------------------------- Unstructured Plain Text Format 1

--------------------------------------------------- Unstructured Title Begin
DOCUMENT TITLE
--------------------------------------------------- Unstructured Title End

--------------------------------------------------- Unstructured NarrativeText Begin
Document content...
--------------------------------------------------- Unstructured NarrativeText End

Table Ground Truth (in simple-data/table-gt directory):

Files use the format: {original_filename}__uns-table-json-v1.0.0__{id}.json

Tables are represented as JSON with cell coordinates and content, serving as the ground truth for our format-agnostic table evaluation

[
  {
    "type": "Table",
    "text": [
      {
        "id": "cell-id",
        "x": 0,
        "y": 0,
        "w": 1,
        "h": 1,
        "content": "Cell content"
      },
      ...
    ]
  }
]

Creating Your Own Ground Truth

To label your own documents:

Create a directory for your document set
For text content:
- Create plaintext files with the appropriate markers for document elements
- Name files according to the convention: {original_filename}__uns-plaintext-v1.0.0__{id}.txt
For tables:
- Create JSON files with table structure and cell content
- Name files according to the convention: {original_filename}__uns-table-json-v1.0.0__{id}.json

Metrics

The SCORE framework includes a multi-dimensional suite of metrics to characterize performance beyond a single score.

Content Fidelity Metrics

These metrics focus on the accuracy of the extracted text content, adjusted to be robust against structural differences.

Metric	Definition & Formula	Context
CCT (Clean Concatenated Text)	Based on Normalized Edit Distance (NED)	Serves as the traditional baseline, measuring raw string-level similarity. It is known to fail when content is semantically equivalent but structurally reorganized.
Adjusted CCT(Adj. NED)	$NED_{adj}(s,g)=max(NED(s,g),\frac{\sum_{k\in K}W_{k}}{W_{total}})$	Corrects the CCT/NED score by applying word-weighted fuzzy alignment across different elements (tables, paragraphs), recognizing semantic equivalence despite structural variation. This recovers equivalence between alternative but valid interpretations.
Percent Tokens Found(TokensFound)	$TokensFound(s,g)=\frac{\sum_{t}min(freq_{s}(t),freq_{g}(t))}{\sum_{t}freq_{g}(t)}$	Measures the recall of reference tokens preserved, independent of ordering. It is a direct indicator of content omission/loss.
Percent Tokens Added(TokensAdded)	$TokensAdded(s,g)=\frac{\sum_{t}max(0,freq_{s}(t)-freq_{g}(t))}{\sum_{t}freq_{s}(t)}$	Measures the proportion of spurious tokens generated. It serves as a direct indicator of the model's hallucination level.

Structural Hierarchy Metrics

These metrics assess the system's ability to capture the organizational structure of the document.

Metric	Context
Element Alignment(F1-based)	Measures how well predicted elements align with ground truth. It is largely a reality-grounded metric that captures how systems map heterogeneous labels into a coherent hierarchy.

Table Structure and Content Metrics

The framework employs a format-agnostic representation where all structured outputs (HTML, JSON, etc.) are mapped into equivalent tuples like $(row=0, col=0, \text{"Q1"})$ for fair comparison.

Metric	Context
Detection F1	Recasts table detection as a content-based classification problem, not a bounding-box matching task. This is critical for VLMs, which lack explicit object detection modules.
Content Accuracy	Measures the accuracy of the extracted cell text/content (text recognition).
Index Accuracy	Measures the accuracy of the spatial relationships (cell index/position). Evaluation incorporates spatial tolerance for positional shifts, ensuring systems aren't penalized for legitimate variations like merged vs. split headers.
TEDS	Tree Edit Distance-based Similarity. This traditional metric is integrated to evaluate hierarchical table structures.

Benchmarking Document Parsing

To benchmark a document parsing system:

Process your documents with the parsing system to generate outputs
Ensure outputs are in the expected format (JSON for structured parsing)
Update the configuration file specifying:
- Metrics to calculate
- Directories for predictions and ground truth
- Format specifications
Run the evaluation scripts to calculate metrics
Analyze the results to identify strengths and weaknesses

Running the Scripts

Installation

Prerequisites

Python 3.11 or higher
uv: https://docs.astral.sh/uv/getting-started/installation/

Setup

Install dependencies using uv:

make install

This will create a virtual environment and install all required dependencies specified in pyproject.toml.

Performs a linting check

make check

Automatically fixes linting issues

make tidy

Packages

Scoring

The scoring package implements performance metrics:

aggregate_scoring.py: Generates aggregated scores across multiple datapoints
cct_adjustment.py: Adjusts CCT (Clean Concatenated Text) metrics for fair comparison
content_scoring.py: Measures OCR quality and reading order accuracy
element_consistency_scoring.py: Evaluates consistency of element detection and classification
structure_scoring.py: Assesses document structure and element alignment
table_scoring.py: Evaluates table structure and content extraction accuracy

Processing

The processing package contains data transformation functions:

aggregate.py: Functions for constructing datapoints and aggregating data across documents
structure_processing.py: Document structure processing and element extraction
table_processing.py: Extracting, parsing, and normalizing table structures
text_processing.py: Text extraction, normalization, and tokenization

Evaluation

The evaluation package provides the evaluation orchestration:

config_model.py: Pydantic-based configuration models for validation
generate_agg_eval_metrics.py: Generates aggregate metrics across datasets
generate_doc_eval_metrics.py: Produces document-level evaluation metrics

Generating Evaluation Metrics

Document-level Metrics

To generate metrics for individual documents:

PYTHONPATH=./src python3 -m evaluation.generate_doc_eval_metrics \
  --config configs/all_metrics_config.json \
  --output_csv reports/all_metrics_report.csv

Aggregate Metrics

To generate aggregate metrics across all documents:

PYTHONPATH=./src python3 -m evaluation.generate_agg_eval_metrics \
  --config configs/all_metrics_config.json \
  --output_csv reports/all_agg_metrics_report.csv

Configuration

Evaluations are configured using JSON files:

{
  "storage": {
      "store_metrics": false  // Storage is used to store metrics in S3. It is set to false by default.
  },
  "formats": {
      "sample_formats": {
          "cct": "v1",
          "adjusted_cct": "v1",
          "percent_tokens_found": "v1",
          "percent_tokens_added": "v1",
          "tables": "html",
          "element_alignment": "v1",
          "element_consistency": "v1"
      },
      "ground_truth_formats": {
          "cct": "v1",
          "adjusted_cct": "v1",
          "percent_tokens_found": "v1",
          "percent_tokens_added": "v1",
          "tables": "text",
          "element_alignment": "v1",
          "element_consistency": "v1"
      }
  },
  "datasets": [
      {
          "name": "simple-data",
          "metrics": [
              "cct",
              "adjusted_cct",
              "percent_tokens_found",
              "percent_tokens_added",
              "element_alignment",
              "element_consistency"
          ],
          "to_evaluate_dir": "./simple-data/od-first-prediction",
          "ground_truth_dir": "./simple-data/cct-gt"
      },
      {
          "name": "simple-data",
          "metrics": [
              "tables"
          ],
          "to_evaluate_dir": "./simple-data/od-first-prediction",
          "ground_truth_dir": "./simple-data/table-gt"
      }
  ]
}

Results

After running the evaluation scripts, results will be saved to the specified CSV file. The results include:

Document-level metrics for each document and metric type
Aggregate metrics across all documents in the dataset

These metrics can be used to:

Compare different document parsing systems
Track improvements in parsing quality over time
Identify specific areas for improvement, distinguishing between hallucination, omission, and structural variation.

License

This project is licensed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
simple-data		simple-data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unstructured Evaluation Metrics (SCORE Framework)

Overview

Repository Structure

Data

Ground Truth Format

Creating Your Own Ground Truth

Metrics

Content Fidelity Metrics

Structural Hierarchy Metrics

Table Structure and Content Metrics

Benchmarking Document Parsing

Running the Scripts

Installation

Prerequisites

Setup

Packages

Scoring

Processing

Evaluation

Generating Evaluation Metrics

Document-level Metrics

Aggregate Metrics

Configuration

Results

License

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

Unstructured-IO/unstructured-eval-metrics

Folders and files

Latest commit

History

Repository files navigation

Unstructured Evaluation Metrics (SCORE Framework)

Overview

Repository Structure

Data

Ground Truth Format

Creating Your Own Ground Truth

Metrics

Content Fidelity Metrics

Structural Hierarchy Metrics

Table Structure and Content Metrics

Benchmarking Document Parsing

Running the Scripts

Installation

Prerequisites

Setup

Packages

Scoring

Processing

Evaluation

Generating Evaluation Metrics

Document-level Metrics

Aggregate Metrics

Configuration

Results

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages