Skip to content

Add preprocessing documentation for DeepSeek-r1 and Llama3.1-8b #2270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anivar
Copy link

@anivar anivar commented Jul 20, 2025

Summary

Addresses issue #2245 by adding comprehensive preprocessing documentation for models that currently lack reproduction methodology.

Changes

  • PREPROCESSING-TEMPLATE.md - Standardized template for future models
  • language/llama3.1-8b/PREPROCESSING.md - Complete preprocessing documentation
  • language/deepseek-r1/PREPROCESSING.md - Comprehensive preprocessing documentation

Problem Solved

Issue #2245 identified that while llama2-70b/processorca.py provides complete preprocessing transparency, newer models hide preprocessing behind preprocessed datasets, breaking reproducibility.

Solution Approach

  • Maintenance-focused: Documents existing patterns rather than introducing new architecture
  • Template-based: Establishes standard for consistent future documentation
  • Gap analysis: Clearly identifies what preprocessing steps are missing
  • Adaptation guidance: Helps users adapt preprocessing for different tokenizers/models

Documentation Added

Llama3.1-8b

  • Documents current CNN/DailyMail preprocessing via download_cnndm.py and prepare-calibration.py
  • Identifies missing filtering, sampling, and quality control steps
  • Provides adaptation guide for different tokenizers
  • References llama2-70b/processorca.py patterns for completeness

DeepSeek-r1

  • Documents current cloud-based preprocessed dataset approach
  • Identifies critical gaps in raw data access and preprocessing scripts
  • Explains challenges for tokenizer/model adaptation
  • Recommends improvements for full reproducibility

Benefits

  1. Immediate transparency - Users understand current preprocessing limitations
  2. Future standardization - Template guides consistent documentation
  3. Adaptation support - Clear guidance for different use cases
  4. Community contribution - Maintenance approach that builds on existing work

Next Steps

This documentation provides foundation for:

  • Creating complete preprocessing scripts for affected models
  • Establishing preprocessing standards across all models
  • Improving benchmark reproducibility and transparency

Closes #2245

@anivar anivar requested a review from a team as a code owner July 20, 2025 10:23
Copy link
Contributor

github-actions bot commented Jul 20, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Contributor

@hanyunfan hanyunfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, more info added for readme files

@arjunsuresh
Copy link
Contributor

@hanyunfan This is a template not actual information. We should pass this to the respective task forces and get the details.

@mrmhodak
Copy link
Contributor

WG Meeting: Will look at this later.

- Created PREPROCESSING.md template for standardized documentation
- Added comprehensive preprocessing documentation for Llama3.1-8b
- Added comprehensive preprocessing documentation for DeepSeek-r1
- Documented current preprocessing gaps and missing reproducibility steps
- Established standard template for future model documentation
- Based documentation on successful llama2-70b/processorca.py patterns

Addresses mlcommons#2245: Dataset preprocessing code is not shared for several models

This maintenance contribution improves preprocessing transparency by:
1. Documenting existing preprocessing patterns
2. Identifying gaps in current documentation
3. Providing template for consistent future documentation
4. Enabling better adaptation across different tokenizers/models
@anivar anivar force-pushed the fix/preprocessing-documentation branch from 79cc505 to 4e425a0 Compare July 24, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset preprocessing code is not shared for several models
4 participants