|
1 | | -# pyspark_huggingface |
2 | | -PySpark custom data source for Hugging Face Datasets |
| 1 | +<p align="center"> |
| 2 | + <img alt="Hugging Face x Spark" src="https://pbs.twimg.com/media/FvN1b_2XwAAWI1H?format=jpg&name=large" width="352" style="max-width: 100%;"> |
| 3 | + <br/> |
| 4 | + <br/> |
| 5 | +</p> |
| 6 | + |
| 7 | +<p align="center"> |
| 8 | + <a href="https://github.com/huggingface/pyspark_huggingface/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/pyspark_huggingface.svg"></a> |
| 9 | + <a href="https://huggingface.co/datasets/"><img alt="Number of datasets" src="https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen"></a> |
| 10 | +</p> |
| 11 | + |
| 12 | +# Spark Data Source for Hugging Face Datasets |
| 13 | + |
| 14 | +A Spark Data Source for accessing [🤗 Hugging Face Datasets](https://huggingface.co/datasets): |
| 15 | + |
| 16 | +- Stream datasets from Hugging Face as Spark DataFrames |
| 17 | +- Select subsets and splits, apply projection and predicate filters |
| 18 | +- Save Spark DataFrames as Parquet files to Hugging Face |
| 19 | +- Fully distributed |
| 20 | +- Authentication via `huggingface-cli login` or tokens |
| 21 | +- Compatible with Spark 4 (with auto-import) |
| 22 | +- Backport for Spark 3.5, 3.4 and 3.3 |
| 23 | + |
| 24 | +## Installation |
| 25 | + |
| 26 | +``` |
| 27 | +pip install pyspark_huggingface |
| 28 | +``` |
| 29 | + |
| 30 | +## Usage |
| 31 | + |
| 32 | +Load a dataset (here [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)): |
| 33 | + |
| 34 | +```python |
| 35 | +df = spark.read.format("huggingface").load("stanfordnlp/imdb") |
| 36 | +``` |
| 37 | + |
| 38 | +Save to Hugging Face: |
| 39 | + |
| 40 | +```python |
| 41 | +# Login with huggingface-cli login |
| 42 | +df.write.format("huggingface").save("username/my_dataset") |
| 43 | +# Or pass a token manually |
| 44 | +df.write.format("huggingface").option("token", "hf_xxx").save("username/my_dataset") |
| 45 | +``` |
| 46 | + |
| 47 | +## Advanced |
| 48 | + |
| 49 | +Select a split: |
| 50 | + |
| 51 | +```python |
| 52 | +test_df = ( |
| 53 | + spark.read.format("huggingface") |
| 54 | + .option("split", "test") |
| 55 | + .load("stanfordnlp/imdb") |
| 56 | +) |
| 57 | +``` |
| 58 | + |
| 59 | +Select a subset/config: |
| 60 | + |
| 61 | +```python |
| 62 | +test_df = ( |
| 63 | + spark.read.format("huggingface") |
| 64 | + .option("config", "sample-10BT") |
| 65 | + .load("HuggingFaceFW/fineweb-edu") |
| 66 | +) |
| 67 | +``` |
| 68 | + |
| 69 | +Filters columns and rows (especially efficient for Parquet datasets): |
| 70 | + |
| 71 | +```python |
| 72 | +df = ( |
| 73 | + spark.read.format("huggingface") |
| 74 | + .option("filters", '[("language_score", ">", 0.99)]') |
| 75 | + .option("columns", '["text", "language_score"]') |
| 76 | + .load("HuggingFaceFW/fineweb-edu") |
| 77 | +) |
| 78 | +``` |
| 79 | + |
| 80 | +## Backport |
| 81 | + |
| 82 | +While the Data Source API was introcuded in Spark 4, this package includes a backport for older versions. |
| 83 | + |
| 84 | +Importing `pyspark_huggingface` patches the PySpark reader and writer to add the "huggingface" data source. It is compatible with PySpark 3.5, 3.4 and 3.3: |
| 85 | + |
| 86 | +```python |
| 87 | +>>> import pyspark_huggingface |
| 88 | +huggingface datasource enabled for pyspark 3.x.x (backport from pyspark 4) |
| 89 | +``` |
| 90 | + |
| 91 | +The import is only necessary on Spark 3.x to enable the backport. |
| 92 | +Spark 4 automatically imports `pyspark_huggingface` as soon as it is installed, and registers the "huggingface" data source. |
0 commit comments