Skip to content

Commit a20660b

Browse files
WeichenXu123cloud-fan
authored andcommitted
[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide
## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 6540c2f) Signed-off-by: Wenchen Fan <[email protected]>
1 parent d5e6948 commit a20660b

File tree

3 files changed

+120
-7
lines changed

3 files changed

+120
-7
lines changed

docs/_data/menu-ml.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
- text: Basic statistics
22
url: ml-statistics.html
3+
- text: Data sources
4+
url: ml-datasource
35
- text: Pipelines
46
url: ml-pipeline.html
57
- text: Extracting, transforming and selecting features

docs/ml-datasource.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
---
2+
layout: global
3+
title: Data sources
4+
displayTitle: Data sources
5+
---
6+
7+
In this section, we introduce how to use data source in ML to load data.
8+
Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
9+
10+
**Table of Contents**
11+
12+
* This will become a table of contents (this text will be scraped).
13+
{:toc}
14+
15+
## Image data source
16+
17+
This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library.
18+
The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema.
19+
The schema of the `image` column is:
20+
- origin: `StringType` (represents the file path of the image)
21+
- height: `IntegerType` (height of the image)
22+
- width: `IntegerType` (width of the image)
23+
- nChannels: `IntegerType` (number of image channels)
24+
- mode: `IntegerType` (OpenCV-compatible type)
25+
- data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
26+
27+
28+
<div class="codetabs">
29+
<div data-lang="scala" markdown="1">
30+
[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
31+
implements a Spark SQL data source API for loading image data as a DataFrame.
32+
33+
{% highlight scala %}
34+
scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
35+
df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
36+
37+
scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
38+
+-----------------------------------------------------------------------+-----+------+
39+
|origin |width|height|
40+
+-----------------------------------------------------------------------+-----+------+
41+
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
42+
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
43+
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
44+
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
45+
+-----------------------------------------------------------------------+-----+------+
46+
{% endhighlight %}
47+
</div>
48+
49+
<div data-lang="java" markdown="1">
50+
[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
51+
implements Spark SQL data source API for loading image data as DataFrame.
52+
53+
{% highlight java %}
54+
Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
55+
imageDF.select("image.origin", "image.width", "image.height").show(false);
56+
/*
57+
Will output:
58+
+-----------------------------------------------------------------------+-----+------+
59+
|origin |width|height|
60+
+-----------------------------------------------------------------------+-----+------+
61+
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
62+
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
63+
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
64+
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
65+
+-----------------------------------------------------------------------+-----+------+
66+
*/
67+
{% endhighlight %}
68+
</div>
69+
70+
<div data-lang="python" markdown="1">
71+
In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
72+
73+
{% highlight python %}
74+
>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
75+
>>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
76+
+-----------------------------------------------------------------------+-----+------+
77+
|origin |width|height|
78+
+-----------------------------------------------------------------------+-----+------+
79+
|file:///spark/data/mllib/images/origin/kittens/54893.jpg |300 |311 |
80+
|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg |199 |313 |
81+
|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300 |200 |
82+
|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg |300 |296 |
83+
+-----------------------------------------------------------------------+-----+------+
84+
{% endhighlight %}
85+
</div>
86+
87+
<div data-lang="r" markdown="1">
88+
In SparkR we provide Spark SQL data source API for loading image data as DataFrame.
89+
90+
{% highlight r %}
91+
> df = read.df("data/mllib/images/origin/kittens", "image")
92+
> head(select(df, df$image.origin, df$image.width, df$image.height))
93+
94+
1 file:///spark/data/mllib/images/origin/kittens/54893.jpg
95+
2 file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
96+
3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
97+
4 file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
98+
width height
99+
1 300 311
100+
2 199 313
101+
3 300 200
102+
4 300 296
103+
104+
{% endhighlight %}
105+
</div>
106+
107+
108+
</div>

mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,17 @@ package org.apache.spark.ml.source.image
1919

2020
/**
2121
* `image` package implements Spark SQL data source API for loading image data as `DataFrame`.
22-
* The loaded `DataFrame` has one `StructType` column: `image`.
22+
* It can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO`
23+
* in Java library.
24+
* The loaded `DataFrame` has one `StructType` column: `image`, containing image data stored
25+
* as image schema.
2326
* The schema of the `image` column is:
24-
* - origin: String (represents the file path of the image)
25-
* - height: Int (height of the image)
26-
* - width: Int (width of the image)
27-
* - nChannels: Int (number of the image channels)
28-
* - mode: Int (OpenCV-compatible type)
29-
* - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
27+
* - origin: `StringType` (represents the file path of the image)
28+
* - height: `IntegerType` (height of the image)
29+
* - width: `IntegerType` (width of the image)
30+
* - nChannels: `IntegerType` (number of image channels)
31+
* - mode: `IntegerType` (OpenCV-compatible type)
32+
* - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
3033
*
3134
* To use image data source, you need to set "image" as the format in `DataFrameReader` and
3235
* optionally specify the data source options, for example:

0 commit comments

Comments
 (0)