Add more features to huggingface reader #3

allisonwang-db · 2024-11-27T07:52:52Z

This PR:

Implements the partitions method in DataSourceReader that uses the num_shards parameter from the IterableDataset to read from the a streaming dataset across multiple workers.
Supports non-streaming mode: load_dataset(..., streaming=False)
Supports dataset configuration: load_dataset(path, name=config_name, ...)
Changed the return type of the data source read method to use an iterator of arrow batches. Note this can only be tested against the Spark master branch build (not with spark4.0.0.dev2 release).

Example

spark.read.format("huggingface")
 .option("split", "train")
 .option("config", "plain_text")
 .option("streaming", "true")
 .load("rotten_tomatoes")

allisonwang-db · 2024-11-27T07:54:31Z

pyspark_huggingface/huggingface.py

+    def partitions(self) -> Sequence[InputPartition]:
+        from datasets import load_dataset
+        if not self.streaming:
+            return [Shard(index=0)]


@lhoestq I am not able to get num_shards for a non-streaming dataset. Do you know if this is supported?

When the dataset is available locally it is loaded as a Dataset as a memory mapped Arrow Table that is the concatenation of all the shards. So in practice you don't really care about the shards themselves since you can take whatever slice of the Table you want. The number of shards can be set to the maximum level of parallelism of the Spark setup, or we can decide to have as many shards as cached Arrow files, or as many shards as Arrow Record Batches for example.

Here load_dataset(..., streaming=False) downloads the full dataset and prepares it as Arrow files locally, so it must be called only once. I understand in the current implementation it would be called once since the number of partitions is set to 1 so it works but it doesn't leverage Spark distributed.

There is this experimental feature that was added a while ago to let load_dataset run in parallel using Spark via joblibspark : huggingface/datasets#5924

with parallel_backend('spark') as backend: ds = load_dataset(..., streaming=False, num_proc=<number of spark jobs to spawn>) # returns directly if the dataset is cached

It's also possible to get the Dataset from the downloaded and prepared Arrow dataset like this

builder = load_dataset_builder(...) with parallel_backend('spark') as backend: builder.download_and_prepare(..., num_proc=...) # returns directly if the dataset is cached ds = builder.as_dataset(split) # EDIT: it should be possible to get an IterableDataset as well but I need to double check

if this doesn't fit the current implementation well we can keep it for later and call the internals of the builder manually in proper Spark code if needed

Sounds good. I can try it out using Spark cluster mode instead of the local mode to see if streaming works better.

lhoestq

Looking good ! we can see if we want to parallelize the non-streaming case later anyway :)

add more data source configs

ea84383

allisonwang-db commented Nov 27, 2024

View reviewed changes

lhoestq approved these changes Nov 29, 2024

View reviewed changes

allisonwang-db merged commit 6f1e38d into main Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more features to huggingface reader #3

Add more features to huggingface reader #3

Uh oh!

allisonwang-db commented Nov 27, 2024

Uh oh!

allisonwang-db Nov 27, 2024

Uh oh!

lhoestq Nov 27, 2024 •

edited

Loading

Uh oh!

lhoestq Nov 27, 2024 •

edited

Loading

Uh oh!

allisonwang-db Dec 4, 2024

Uh oh!

lhoestq left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add more features to huggingface reader #3

Add more features to huggingface reader #3

Uh oh!

Conversation

allisonwang-db commented Nov 27, 2024

This PR:

Example

Uh oh!

allisonwang-db Nov 27, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhoestq Nov 27, 2024 •

edited

Loading

lhoestq Nov 27, 2024 •

edited

Loading