16 Jul 13:05

HYLcool

7505686

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage. Latest

Latest

Major Updates

🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

New analysis method: correlation analysis among stats is added. #663
Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
Support store and process bytes data of images in the dataset. #725

Bugs Fixed

The wheel & docker image building bug is fixed. #706
Fix bugs in log_summarization. #710
Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

@fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
@ayushdg helps to support a GPU-version Minhash deduplicator. #644
@ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: v1.4.0...v1.4.1

Contributors

ayushdg, fanronghai, and ricksun2023

Assets 3

13 Jun 11:43

yxdyc

v1.4.0

714df97

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.

🔧 Major Refactors & Improvements

🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
📦 Dependency Management Update (#660, #680):
- Migrated to uv for faster dependency resolution.
- Added sub-groups for better organization.

🌍 New Features & Integrations (#683, #688, #692)

🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
🛠️ New Operators Introduced (#673, #701):
- llm_analysis_filter
- general_field_filter

🚀 Core Optimizations & Bug Fixes

✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
🐳 Docker Build Improvement:
- Ignore installed distutils libraries during Docker image building. (#668)
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)

📚 Full Changelog

View all changes since v1.3.3 →

Assets 3

09 May 10:20

HYLcool

v1.3.3

444537e

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
Add new OPs and recipes for Img-Diff. #658

Enhancements

Support HF llm for two llm_xxx_score_filter OPs. #655
Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

Address possibly missing cfg in unify_format. #653
Improve clarity & fix bad links for some docs. #659

Acknowledgement

@co63oc helps to fix some typos. #654

Full Changelog: v1.3.2...v1.3.3

Contributors

co63oc

Assets 3

25 Apr 11:17

yxdyc

v1.3.2

2172698

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
OP efficiency optimization of document_minhash_deduplicator, in #639
set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
fix date typo by in #648
Fix docker building failure in #650
Fix StreamToLoguru compatibility issue with torch._dynamo in #651
add init file for annotation module, fix dj-process command error in #652

New Contributor

@cmgzn made their first contribution in #651

Contributors

cmgzn

Assets 3

11 Apr 09:48

HYLcool

v1.3.1

e90a759

Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops

New OPs

extract_tables_from_html_mapper: extract tables from html texts. #634
general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

fix dataset builder initialization failure #630
update Executor references from Executor to DefaultExecutor #632 #633
switch the backend of plt to avoid sub-process/thread error #633
fix some boundary condition bugs in several deduplicators #635 #637

Others

check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
update docs to highlight light env installation part. #636

Acknowledgement

@liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635

Full Changelog: v1.3.0...v1.3.1

Contributors

liuyuhanalex

Assets 3

28 Mar 12:08

yxdyc

v1.3.0

1b9afd1

Release v1.3.0: Refactor of dataset builder and executor!

The Big Change 🚀

Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)

Contributors

cyruszhang and liuyuhanalex

Assets 3

14 Mar 09:58

BeachWang

v1.2.2

8d09410

Release v1.2.2

Major Updates

🧪 Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

Fix config in LLaVa pretrain recipe. #610
Update news for MindGYM and fix doc. #615
Fix decode error through UTF-8 decoding. #618

Assets 3

28 Feb 07:50

HYLcool

v1.2.1

6014bcc

Release v1.2.1

Major Updates

DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
Unit test optimization:
- split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
- use primitive @unittest.skip and remove SKIPPED_TESTS. #586
- upload test coverage reports to GitHub artifacts. #586

New OPs

image_remove_background_mapper: remove the background of images. #589

Others

add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
only build doc for py3.10. #586
move dependency on ray to minimal requirements. #586 #594 #595
allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
fix undefined fileno bug of the logger. #594

Acknowledgement

@liuyuhanalex helps simplify the code logic of OP fusion, add a new OP image_remove_background_mapper, and fix some minor bugs. #581 #585 #589
@co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593
@danielhjz helps to fix the implicit memory leak problem in image_nsfw_filter. #590

Contributors

co63oc, danielhjz, and liuyuhanalex

Assets 3

14 Feb 09:40

yxdyc

v1.2.0

7820a4d

v1.2.0 Doc refactored; New algorithm proposed

What's New

📚 The DJ doc is refactored and improved, e.g., RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links
🔎 More unit-tests added.
🎛 The data pre-split and export are improved.
🔮 A new data selection method, DaaR, is proposed. See Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data.

Detailed PRs

fix export error when export_stats columns is null in #557
Resplit input dataset in ray mode in #549
Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
Resolve most skipped unit-tests by in #559
fix translation error in #562
Add unittest for ray text dedup in #540
[Typo]correct a small typo in #563
update the 2.0 paper link & the DaaR news in #566
Fix typos in #571
Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
Fix typos in #572

Acknowledgment

@liuyuhanalex @co63oc made their first PRs

Full Changelog: v1.1.0...v1.2.0

Contributors

co63oc and liuyuhanalex

Assets 3

17 Jan 09:46

BeachWang

v1.1.0

030e786

Release v1.1.0

Major Updates

🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
🛝 Add usability tags for OPs:
- alpha tag for OPs in which only the basic OP implementations are finished;
- beta tag for OPs in which unittests are added based on the alpha version;
- stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
sentence_augmentation_mapper: Augment sentences using LLMs. #550
text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
Fix model force download bug. #529
Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
Fix missing field meta tag on ray mode. #538
Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
Fix bug in the role playing data generation demo. #545

Others

Enhance unit test for API calling OPs. #528
Remove sandbox requirements installation from Dockerfile. #530
Update the datasource related APIs to be compatible with the latest version of Ray. #532
Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
Update docs for preparing DJ2.0 release. #542
Update a quick cdn link for arch figure. #543
Add a video demo for role playing data generation. #545
Optimize op doc for global textual search. #552
Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

@Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550

Contributors

Qirui-jiao

Assets 3

Releases: modelscope/data-juicer

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

New Operators

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

🔧 Major Refactors & Improvements

🌍 New Features & Integrations (#683, #688, #692)

🚀 Core Optimizations & Bug Fixes

📚 Full Changelog

Uh oh!

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

New Contributor

Contributors

Uh oh!

Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

New OPs

Bug Fixed

Others

Acknowledgement

Contributors

Uh oh!

Release v1.3.0: Refactor of dataset builder and executor!

The Big Change 🚀

Others 💡

Contributors

Uh oh!

Release v1.2.2

Major Updates

New OPs

Others

Uh oh!

Release v1.2.1

Major Updates

New OPs

Others

Acknowledgement

Contributors

Uh oh!

v1.2.0 Doc refactored; New algorithm proposed

What's New

Detailed PRs

Acknowledgment

Contributors

Uh oh!

Release v1.1.0

Major Updates

New OPs

Bug Fixed

Others

Acknowledgement

Contributors

Uh oh!