Skip to content

Releases: modelscope/data-juicer

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

16 Jul 13:05
7505686
Compare
Choose a tag to compare

Major Updates

  • 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
  • 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
  • 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
  • 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
  • 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

  • download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

  • New analysis method: correlation analysis among stats is added. #663
  • Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
  • The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
  • Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
  • Support store and process bytes data of images in the dataset. #725

Bugs Fixed

  • The wheel & docker image building bug is fixed. #706
  • Fix bugs in log_summarization. #710
  • Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

  • @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
  • @ayushdg helps to support a GPU-version Minhash deduplicator. #644
  • @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: v1.4.0...v1.4.1

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

13 Jun 11:43
714df97
Compare
Choose a tag to compare

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.


🔧 Major Refactors & Improvements

  • 🔄 Sandbox Usability (#686):

    • Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
    • Includes the InternVL example as a showcase.
  • 📘 DJ-Doc Redesign (#675):

    • Now with multilingual support (English / Chinese) and a modernized style.
  • 📦 Dependency Management Update (#660, #680):

    • Migrated to uv for faster dependency resolution.
    • Added sub-groups for better organization.

🌍 New Features & Integrations (#683, #688, #692)

  • 🆕 Additional Repo Supported:

  • 📜 DJ-Awesome-List:

    • A survey paper accepted by TPAMI'25!
  • 🧪 Synthetic Benchmark Added:

    • DetailMaster – a new benchmark for synthetic data evaluation.
  • 🛠️ New Operators Introduced (#673, #701):

    • llm_analysis_filter
    • general_field_filter

🚀 Core Optimizations & Bug Fixes

  • Ray Executor Enhancements (#697):

    • File extension detection added.
    • Support for more data formats.
  • ⏱️ Startup Time Optimization:

    • Improved startup performance. (#684)
  • 🧠 Text Embedding Support:

    • Added support for text embedding via API and local model. (#681)
  • 🐳 Docker Build Improvement:

    • Ignore installed distutils libraries during Docker image building. (#668)
  • 🛠️ Mapper Module Fix:

    • Fixed issue with module initialization. (#700)
  • 🗑️ Warning Suppression:

    • Suppressed unnecessary warnings from fasttext. (#696)

📚 Full Changelog

View all changes since v1.3.3 →

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

09 May 10:20
444537e
Compare
Choose a tag to compare

Major Updates

  • 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
  • Add new OPs and recipes for Img-Diff. #658

Enhancements

  • Support HF llm for two llm_xxx_score_filter OPs. #655
  • Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
  • Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

  • Address possibly missing cfg in unify_format. #653
  • Improve clarity & fix bad links for some docs. #659

Acknowledgement

Full Changelog: v1.3.2...v1.3.3

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

25 Apr 11:17
2172698
Compare
Choose a tag to compare

What's Changed

  • Human OP enhancements, in #642 #645
    • update label-studio version
    • make service script more robust
    • add documentation
    • optimizing fields mapping
  • OP efficiency optimization of document_minhash_deduplicator, in #639
  • set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
  • fix date typo by in #648
  • Fix docker building failure in #650
  • Fix StreamToLoguru compatibility issue with torch._dynamo in #651
  • add init file for annotation module, fix dj-process command error in #652

New Contributor

Release v1.3.1: added HumanOPs & fixed some bugs

11 Apr 09:48
e90a759
Compare
Choose a tag to compare

Major Updates

  • 💥 prototype Implementation for HumanOps (annotation). #617 Included features:
    • boilerplate code for supporting label studio powered human annotation ops
    • a human preference annotation reference implementation is provided
    • label studio service script; can start up local instance using docker or pip, whichever is available
    • reference configs and data
    • event driven and notification mixins framework for ops

New OPs

  • extract_tables_from_html_mapper: extract tables from html texts. #634
  • general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

  • fix dataset builder initialization failure #630
  • update Executor references from Executor to DefaultExecutor #632 #633
  • switch the backend of plt to avoid sub-process/thread error #633
  • fix some boundary condition bugs in several deduplicators #635 #637

Others

  • check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
  • update docs to highlight light env installation part. #636

Acknowledgement

Full Changelog: v1.3.0...v1.3.1

Release v1.3.0: Refactor of dataset builder and executor!

28 Mar 12:08
1b9afd1
Compare
Choose a tag to compare

The Big Change 🚀

Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)

Release v1.2.2

14 Mar 09:58
8d09410
Compare
Choose a tag to compare

Major Updates

  • 🧪 Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
  • 🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
  • new A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

  • llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
  • llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

  • Fix config in LLaVa pretrain recipe. #610
  • Update news for MindGYM and fix doc. #615
  • Fix decode error through UTF-8 decoding. #618

Release v1.2.1

28 Feb 07:50
6014bcc
Compare
Choose a tag to compare

Major Updates

  • new DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
  • new Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR 2025!
  • Unit test optimization:
    • split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598
    • use primitive @unittest.skip and remove SKIPPED_TESTS. #586
    • upload test coverage reports to GitHub artifacts. #586

New OPs

  • image_remove_background_mapper: remove the background of images. #589

Others

  • add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585
  • only build doc for py3.10. #586
  • move dependency on ray to minimal requirements. #586 #594 #595
  • allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597
  • fix undefined fileno bug of the logger. #594

Acknowledgement

v1.2.0 Doc refactored; New algorithm proposed

14 Feb 09:40
7820a4d
Compare
Choose a tag to compare

What's New

Detailed PRs

  • fix export error when export_stats columns is null in #557
  • Resplit input dataset in ray mode in #549
  • Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in #561
  • Resolve most skipped unit-tests by in #559
  • fix translation error in #562
  • Add unittest for ray text dedup in #540
  • [Typo]correct a small typo in #563
  • update the 2.0 paper link & the DaaR news in #566
  • Fix typos in #571
  • Optimization for sdxl_prompt2prompt_mapper dependency importing by in #570
  • Fix typos in #572

Acknowledgment

Full Changelog: v1.1.0...v1.2.0

Release v1.1.0

17 Jan 09:46
030e786
Compare
Choose a tag to compare

Major Updates

  • 🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
  • 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
  • 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
  • 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
  • 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
  • 🛝 Add usability tags for OPs:
    • alpha tag for OPs in which only the basic OP implementations are finished;
    • beta tag for OPs in which unittests are added based on the alpha version;
    • stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

  • image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
  • mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
  • sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
  • sentence_augmentation_mapper: Augment sentences using LLMs. #550
  • text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Bug Fixed

  • Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
  • Fix model force download bug. #529
  • Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
  • Fix missing field meta tag on ray mode. #538
  • Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
  • Fix bug in the role playing data generation demo. #545

Others

  • Enhance unit test for API calling OPs. #528
  • Remove sandbox requirements installation from Dockerfile. #530
  • Update the datasource related APIs to be compatible with the latest version of Ray. #532
  • Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
  • Update docs for preparing DJ2.0 release. #542
  • Update a quick cdn link for arch figure. #543
  • Add a video demo for role playing data generation. #545
  • Optimize op doc for global textual search. #552
  • Use a more stable and fast translator than google translator for automatic OP doc building. #554

Acknowledgement

  • @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550