[SPARK-25347][ML][DOC] Spark datasource for image/libsvm user guide

WeichenXu123 · cloud-fan · commit a20660b6fcee · 2018-10-25T23:04:06.000+08:00
## What changes were proposed in this pull request? Spark datasource for image/libsvm user guide ## How was this patch tested? Scala: <img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png"> Java: <img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png"> Python: <img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png"> R: <img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png"> Closes #22675 from WeichenXu123/add_image_source_doc. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6540c2f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
@@ -1,5 +1,7 @@
 - text: Basic statistics
   url: ml-statistics.html
+- text: Data sources
+  url: ml-datasource
 - text: Pipelines
   url: ml-pipeline.html
 - text: Extracting, transforming and selecting features
diff --git a/docs/ml-datasource.md b/docs/ml-datasource.md
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Data sources
+displayTitle: Data sources
+---
+
+In this section, we introduce how to use data source in ML to load data.
+Beside some general data sources such as Parquet, CSV, JSON and JDBC, we also provide some specific data sources for ML.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+## Image data source
+
+This image data source is used to load image files from a directory, it can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO` in Java library.
+The loaded DataFrame has one `StructType` column: "image", containing image data stored as image schema.
+The schema of the `image` column is:
+ - origin: `StringType` (represents the file path of the image)
+ - height: `IntegerType` (height of the image)
+ - width: `IntegerType` (width of the image)
+ - nChannels: `IntegerType` (number of image channels)
+ - mode: `IntegerType` (OpenCV-compatible type)
+ - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
+
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
+implements a Spark SQL data source API for loading image data as a DataFrame.
+
+{% highlight scala %}
+scala> val df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
+df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>]
+
+scala> df.select("image.origin", "image.width", "image.height").show(truncate=false)
++-----------------------------------------------------------------------+-----+------+
+|origin                                                                 |width|height|
++-----------------------------------------------------------------------+-----+------+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
++-----------------------------------------------------------------------+-----+------+
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
+implements Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight java %}
+Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
+imageDF.select("image.origin", "image.width", "image.height").show(false);
+/*
+Will output:
++-----------------------------------------------------------------------+-----+------+
+|origin                                                                 |width|height|
++-----------------------------------------------------------------------+-----+------+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
++-----------------------------------------------------------------------+-----+------+
+*/
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight python %}
+>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
+>>> df.select("image.origin", "image.width", "image.height").show(truncate=False)
++-----------------------------------------------------------------------+-----+------+
+|origin                                                                 |width|height|
++-----------------------------------------------------------------------+-----+------+
+|file:///spark/data/mllib/images/origin/kittens/54893.jpg               |300  |311   |
+|file:///spark/data/mllib/images/origin/kittens/DP802813.jpg            |199  |313   |
+|file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg |300  |200   |
+|file:///spark/data/mllib/images/origin/kittens/DP153539.jpg            |300  |296   |
++-----------------------------------------------------------------------+-----+------+
+{% endhighlight %}
+</div>
+
+<div data-lang="r" markdown="1">
+In SparkR we provide Spark SQL data source API for loading image data as DataFrame.
+
+{% highlight r %}
+> df = read.df("data/mllib/images/origin/kittens", "image")
+> head(select(df, df$image.origin, df$image.width, df$image.height))
+
+1               file:///spark/data/mllib/images/origin/kittens/54893.jpg
+2            file:///spark/data/mllib/images/origin/kittens/DP802813.jpg
+3 file:///spark/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg
+4            file:///spark/data/mllib/images/origin/kittens/DP153539.jpg
+  width height
+1   300    311
+2   199    313
+3   300    200
+4   300    296
+
+{% endhighlight %}
+</div>
+
+
+</div>
diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
@@ -19,14 +19,17 @@ package org.apache.spark.ml.source.image
 
 /**
  * `image` package implements Spark SQL data source API for loading image data as `DataFrame`.
- * The loaded `DataFrame` has one `StructType` column: `image`.
+ * It can load compressed image (jpeg, png, etc.) into raw image representation via `ImageIO`
+ * in Java library.
+ * The loaded `DataFrame` has one `StructType` column: `image`, containing image data stored
+ * as image schema.
  * The schema of the `image` column is:
- *  - origin: String (represents the file path of the image)
- *  - height: Int (height of the image)
- *  - width: Int (width of the image)
- *  - nChannels: Int (number of the image channels)
- *  - mode: Int (OpenCV-compatible type)
- *  - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
+ *  - origin: `StringType` (represents the file path of the image)
+ *  - height: `IntegerType` (height of the image)
+ *  - width: `IntegerType` (width of the image)
+ *  - nChannels: `IntegerType` (number of image channels)
+ *  - mode: `IntegerType` (OpenCV-compatible type)
+ *  - data: `BinaryType` (Image bytes in OpenCV-compatible order: row-wise BGR in most cases)
  *
  * To use image data source, you need to set "image" as the format in `DataFrameReader` and
  * optionally specify the data source options, for example: