Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

anivar · 2025-07-20T10:23:16Z

Summary

Addresses issue #2245 by adding comprehensive preprocessing documentation for models that currently lack reproduction methodology.

Changes

PREPROCESSING-TEMPLATE.md - Standardized template for future models
language/llama3.1-8b/PREPROCESSING.md - Complete preprocessing documentation
language/deepseek-r1/PREPROCESSING.md - Comprehensive preprocessing documentation

Problem Solved

Issue #2245 identified that while llama2-70b/processorca.py provides complete preprocessing transparency, newer models hide preprocessing behind preprocessed datasets, breaking reproducibility.

Solution Approach

Maintenance-focused: Documents existing patterns rather than introducing new architecture
Template-based: Establishes standard for consistent future documentation
Gap analysis: Clearly identifies what preprocessing steps are missing
Adaptation guidance: Helps users adapt preprocessing for different tokenizers/models

Documentation Added

Llama3.1-8b

Documents current CNN/DailyMail preprocessing via download_cnndm.py and prepare-calibration.py
Identifies missing filtering, sampling, and quality control steps
Provides adaptation guide for different tokenizers
References llama2-70b/processorca.py patterns for completeness

DeepSeek-r1

Documents current cloud-based preprocessed dataset approach
Identifies critical gaps in raw data access and preprocessing scripts
Explains challenges for tokenizer/model adaptation
Recommends improvements for full reproducibility

Benefits

Immediate transparency - Users understand current preprocessing limitations
Future standardization - Template guides consistent documentation
Adaptation support - Clear guidance for different use cases
Community contribution - Maintenance approach that builds on existing work

Next Steps

This documentation provides foundation for:

Creating complete preprocessing scripts for affected models
Establishing preprocessing standards across all models
Improving benchmark reproducibility and transparency

Closes #2245

github-actions · 2025-07-20T10:23:32Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

hanyunfan

LGTM, more info added for readme files

arjunsuresh · 2025-07-22T14:06:33Z

@hanyunfan This is a template not actual information. We should pass this to the respective task forces and get the details.

mrmhodak · 2025-07-22T16:23:45Z

WG Meeting: Will look at this later.

- Created PREPROCESSING.md template for standardized documentation - Added comprehensive preprocessing documentation for Llama3.1-8b - Added comprehensive preprocessing documentation for DeepSeek-r1 - Documented current preprocessing gaps and missing reproducibility steps - Established standard template for future model documentation - Based documentation on successful llama2-70b/processorca.py patterns Addresses mlcommons#2245: Dataset preprocessing code is not shared for several models This maintenance contribution improves preprocessing transparency by: 1. Documenting existing preprocessing patterns 2. Identifying gaps in current documentation 3. Providing template for consistent future documentation 4. Enabling better adaptation across different tokenizers/models

anivar requested a review from a team as a code owner July 20, 2025 10:23

anivar mentioned this pull request Jul 20, 2025

Llama-8B missing SingleStream accuracy #2268

Closed

hanyunfan approved these changes Jul 21, 2025

View reviewed changes

anivar force-pushed the fix/preprocessing-documentation branch from 79cc505 to 4e425a0 Compare July 24, 2025 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

anivar commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025 •

edited

Loading

Uh oh!

hanyunfan left a comment

Uh oh!

arjunsuresh commented Jul 22, 2025

Uh oh!

mrmhodak commented Jul 22, 2025

Uh oh!

Uh oh!

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Are you sure you want to change the base?

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

Conversation

anivar commented Jul 20, 2025

Summary

Changes

Problem Solved

Solution Approach

Documentation Added

Llama3.1-8b

DeepSeek-r1

Benefits

Next Steps

Uh oh!

github-actions bot commented Jul 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hanyunfan left a comment

Choose a reason for hiding this comment

Uh oh!

arjunsuresh commented Jul 22, 2025

Uh oh!

mrmhodak commented Jul 22, 2025

Uh oh!

Uh oh!

github-actions bot commented Jul 20, 2025 •

edited

Loading