Skip to content

Conversation

@olarayej
Copy link

@olarayej olarayej commented Oct 5, 2015

Method coltypes() to get R's data types of a DataFrame

@shivaram
Copy link
Contributor

shivaram commented Oct 5, 2015

Jenkins, add to whitelist

@shivaram
Copy link
Contributor

shivaram commented Oct 5, 2015

Jenkins, ok to test

@SparkQA
Copy link

SparkQA commented Oct 5, 2015

Test build #43260 has finished for PR 8984 at commit b44152e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you check for the case when it doesn't match the known types?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung Yeah, that's a good point. I'm thinking coltypes() should always have an equivalent R data type for each column. We don't want method coltypes() to return NA's or throw an unsupported-type error cuz that would mean that the input DataFrame is inconsistent.

Therefore, it'd be just a matter of putting in DATA_TYPES, the list all possible values returned by dtypes() (If I'm missing any). I couldn't find that in the docs. Could you point me to the list?

Finally, I think the check for unsupported data types should be done instead in the coltypes()<- method and in the DataFrame initialization. coltypes() assumes the input DataFrame was assigned valid data types, which makes sense to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung, @shivaram: Any thoughts on this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types is a list that might be helpful.

Also I think it might make sense to try and map them to R types and if we fail to find a relevant one we fallback to the SparkSQL type

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram I agree. I could use the mapping below (got the short types from schema.R:118):
scala -> R
"string"="character",
"long"="integer",
"short"="integer",
"integer"="integer"
"byte"="integer",
"double"="numeric",
"float"="numeric",
"decimal"="numeric",
"boolean"="logical"

In any other case, I will use the same scala type. Sounds good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. This sounds good.

@shivaram
Copy link
Contributor

shivaram commented Oct 8, 2015

@olarayej Could you bring this PR up to date with master branch ?

@olarayej
Copy link
Author

olarayej commented Oct 8, 2015

@shivaram Could you share the best practices to merge the changes from the master branch into the PR branch? This looks like a very common thing and the team (@NarineK, @adrian555, and myself) have tried quite a few options already, but none of them look pretty. We'd really appreciate any guidance. Thanks!

@shivaram
Copy link
Contributor

shivaram commented Oct 8, 2015

There are a number of ways to do this, so this is just the way I do it personally. In my case I have two remotes in my git setup. So my .git/config looks something like

...
[remote "origin"]
        url = https://github.com/shivaram/spark-1.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[remote "apache-spark"]
        url = https://github.com/apache/spark.git
        fetch = +refs/heads/*:refs/remotes/apache-spark/*
...

So if I'm on a feature branch say SPARK-10863 I do the following

> git fetch apache-spark master 
...
From https://github.com/apache/spark
 * branch            master     -> FETCH_HEAD
...
> git merge FETCH_HEAD
... Accept the merge commit message that shows up
> git log -2 # Optionally use this to verify if things look fine
> git push origin SPARK-10863
... This will push changes to your fork for this branch

Let me know if this works for you

@SparkQA
Copy link

SparkQA commented Oct 8, 2015

Test build #43418 has finished for PR 8984 at commit 523bfbf.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@olarayej
Copy link
Author

olarayej commented Oct 9, 2015

@shivaram Yes, that was helpful. Thank you! I have done the merge already. Jenkins, could you run tests?

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43456 has finished for PR 8984 at commit b1afe8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

+1 on @shivaram comment on data-type above.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43487 has finished for PR 8984 at commit 76fe59a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@olarayej
Copy link
Author

olarayej commented Oct 9, 2015

Thanks, @felixcheung @shivaram. I have committed my changes and tests have passed :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test with some other types ? Also another one which runs into the NA case and uses the SQL type would be useful.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43495 has finished for PR 8984 at commit d53e8b3.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43498 has finished for PR 8984 at commit baec23f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Oct 9, 2015

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Oct 9, 2015

Test build #43499 has finished for PR 8984 at commit baec23f.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Oct 9, 2015

@shaneknapp This one seems to be failing with

sbt.ResolveException: download failed: org.apache.spark#spark-unsafe_2.10;1.5.0!spark-unsafe_2.10.jar

Any idea whats up ?

@olarayej
Copy link
Author

olarayej commented Nov 6, 2015

@felixcheung I have tried quite a few things already but unfortunately, I haven't been able to do the rebase. Could you provide some suggestions? Thanks!

@shivaram
Copy link
Contributor

shivaram commented Nov 9, 2015

@olarayej Do the git merge commands in #8984 (comment) not work ?

@olarayej
Copy link
Author

olarayej commented Nov 9, 2015

@shivaram @felixcheung
I followed the same steps described by @shivaram.

What's confusing for us is that every time we run a fetch followed by a merge, it triggers conflicts with a number of files that we haven't modified (even outside the R folder). After I solved all conflicts, and ran a push, it also pushed those files. Now there are 194 modified files, which makes things pretty messy.

I'm thinking about creating a new branch and discard this one. Thoughts?

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45398 has finished for PR 8984 at commit 0bc5b35.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 9, 2015

Test build #45400 has finished for PR 8984 at commit ba091fb.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@shivaram
Copy link
Contributor

shivaram commented Nov 9, 2015

Yeah something seems to be messed up. You shouldn't get other files changed if you do a fetch + merge as long as the rest of your tree is synced to the same place.

You can open a new PR if you feel that its getting messy in this case -- The only downside is that we lose all these comments we had etc. but since this PR is close to being merged its probably fine in this case.

@olarayej
Copy link
Author

I have created a new branch and PR #9579 to follow up on this.

asfgit pushed a commit that referenced this pull request Nov 10, 2015
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.

Author: Oscar D. Lara Yejas <[email protected]>

Closes #9579 from olarayej/SPARK-10863_NEW14.

(cherry picked from commit 47735cd)
Signed-off-by: Shivaram Venkataraman <[email protected]>
asfgit pushed a commit that referenced this pull request Nov 10, 2015
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.

Author: Oscar D. Lara Yejas <[email protected]>

Closes #9579 from olarayej/SPARK-10863_NEW14.
@shivaram
Copy link
Contributor

@olarayej Could you close this PR ? Only the person who opened the PR can close it and it helps clear our PR queue at https://spark-prs.appspot.com/#r

@olarayej
Copy link
Author

Closing this PR as #9579 has been created to follow up....

@olarayej olarayej closed this Nov 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants