Skip to content

Conversation

@WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Oct 9, 2018

What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

How was this patch tested?

Scala:
1

Java:
2

Python:
3

R:
4

@SparkQA
Copy link

SparkQA commented Oct 9, 2018

Test build #97141 has finished for PR 22675 at commit 887f528.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why did we put the image source inside of Spark, rather then a separate module? (see also #21742 (comment)). Avro was put as a separate module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mengxr as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually it depends on how important the use case is. For example, CSV was created as an external data source and later merged into Spark. See https://issues.apache.org/jira/browse/SPARK-21866?focusedCommentId=16148268&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16148268.

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant (external) Avro was merged into external/... in Apache Spark as a separate module due to the reason above. Image data source is merged into Spark's main code rather then a separate module. I don't object to bring an external into Apache Spark and I don't doubt you guys's judgement - ++1 for bring this in actually.

My point is I was wondering why this exists in Spark's main code whereas the ideal approach is to put them external/....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan and @gatorsmile, am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sympathize with the comment, but I think it makes some sense tucked into ML rather than a standalone module.

@felixcheung
Copy link
Member

we need this for 2.4?

@WeichenXu123
Copy link
Contributor Author

This do not block 2.4 release. But merge before 2.4 is better.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some proofreading but the idea seems OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • data sources -> data sources like
  • Do you want to just say "Parquet, CSV, JSON, and JDBC"? they aren't code identifiers here
  • parquat -> parquet
  • data source -> data sources

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from directory -> from a directory
Maybe say "This data source loads images in libsvm format from a directory"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implements Spark -> implements a Spark
as DataFrame -> as a DataFrame
This sentence is repeated three times. Can you move the shared text out of the language-specific code blocks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala -> Scala, but this is about Python.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97184 has finished for PR 22675 at commit cc4c967.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2018

Test build #97185 has finished for PR 22675 at commit 994d09d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON, JDBC -> JSON and JDBC

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we describe which image we can load? For instance, I think this delegates to ImageIO in Java which allows to read compressed format like PNG or JPG to raw image representation like BMP so that OpenCV can handles them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also describe the schema structure and what each field means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about SQL syntax? I think we can use CREATE TABLE tableA USING LOCATION 'data/image.png'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like SQL features and fit all datasources. Put it in spark SQL doc will be better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add an example for R as well then? It wouldn't be too difficult to add the equivalent examples. Also, I don't think we will add the equivalent examples in different languages at different pages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a simple transformation so that how the image datasource can be utilized?

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97514 has finished for PR 22675 at commit ba0d888.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use SQL types consistently, for instance, StringType, IntegerType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. -> ,.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Datasource or Data sources? I am saying this because there looks a mismatch with the menu above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really personal preference tho .. like -> such as

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we consistently make some codes such as StructType as codes like `StructType`

@HyukjinKwon
Copy link
Member

Looks cool otherwise!

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97541 has finished for PR 22675 at commit 1b54044.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"some specific data sources for ML"

@imatiach-msft
Copy link
Contributor

@WeichenXu123 this looks good - one recommendation though, in the examples it might be good to pass the param that ignores bad images so that the not-image.txt is not included in the output, it might confuse people who are new to the API - unless you feel strongly about having the not-image.txt output in the examples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this convention, to have this text here in the table of contents? "* This will become a table of contents (this text will be scraped)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This keep the same with other ML algo page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, great

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: number of image channels (no "the")

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97889 has finished for PR 22675 at commit 8231cb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@imatiach-msft
Copy link
Contributor

LGTM, this is great to have!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from me as well

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.4!

asfgit pushed a commit that referenced this pull request Oct 25, 2018
## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
<img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png">

Java:
<img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png">

Python:
<img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png">

R:
<img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png">

Closes #22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 6540c2f)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 6540c2f Oct 25, 2018
@WeichenXu123 WeichenXu123 deleted the add_image_source_doc branch October 25, 2018 15:12
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Spark datasource for image/libsvm user guide

## How was this patch tested?
Scala:
<img width="1022" alt="1" src="https://user-images.githubusercontent.com/19235986/47330111-a4f2e900-d6a9-11e8-9a6f-609fb8cd0f8a.png">

Java:
<img width="1019" alt="2" src="https://user-images.githubusercontent.com/19235986/47330114-a9b79d00-d6a9-11e8-97fe-c7e4b8dd5086.png">

Python:
<img width="1022" alt="3" src="https://user-images.githubusercontent.com/19235986/47330120-afad7e00-d6a9-11e8-8a0c-4340c2af727b.png">

R:
<img width="1024" alt="4" src="https://user-images.githubusercontent.com/19235986/47330126-b3410500-d6a9-11e8-9329-5e6217718edd.png">

Closes apache#22675 from WeichenXu123/add_image_source_doc.

Authored-by: WeichenXu <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants