-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10863][SPARKR] Method coltypes() to get R's data types of a DataFrame #8984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Changed setMethod stle Removed return()
Conflicts: R/pkg/R/DataFrame.R
# Conflicts: # R/pkg/R/DataFrame.R
|
Jenkins, add to whitelist |
|
Jenkins, ok to test |
|
Test build #43260 has finished for PR 8984 at commit
|
R/pkg/R/DataFrame.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you check for the case when it doesn't match the known types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung Yeah, that's a good point. I'm thinking coltypes() should always have an equivalent R data type for each column. We don't want method coltypes() to return NA's or throw an unsupported-type error cuz that would mean that the input DataFrame is inconsistent.
Therefore, it'd be just a matter of putting in DATA_TYPES, the list all possible values returned by dtypes() (If I'm missing any). I couldn't find that in the docs. Could you point me to the list?
Finally, I think the check for unsupported data types should be done instead in the coltypes()<- method and in the DataFrame initialization. coltypes() assumes the input DataFrame was assigned valid data types, which makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung, @shivaram: Any thoughts on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types is a list that might be helpful.
Also I think it might make sense to try and map them to R types and if we fail to find a relevant one we fallback to the SparkSQL type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shivaram I agree. I could use the mapping below (got the short types from schema.R:118):
scala -> R
"string"="character",
"long"="integer",
"short"="integer",
"integer"="integer"
"byte"="integer",
"double"="numeric",
"float"="numeric",
"decimal"="numeric",
"boolean"="logical"
In any other case, I will use the same scala type. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. This sounds good.
|
@olarayej Could you bring this PR up to date with master branch ? |
|
@shivaram Could you share the best practices to merge the changes from the master branch into the PR branch? This looks like a very common thing and the team (@NarineK, @adrian555, and myself) have tried quite a few options already, but none of them look pretty. We'd really appreciate any guidance. Thanks! |
|
There are a number of ways to do this, so this is just the way I do it personally. In my case I have two remotes in my git setup. So my .git/config looks something like So if I'm on a feature branch say Let me know if this works for you |
|
Test build #43418 has finished for PR 8984 at commit
|
|
@shivaram Yes, that was helpful. Thank you! I have done the merge already. Jenkins, could you run tests? |
|
Test build #43456 has finished for PR 8984 at commit
|
|
+1 on @shivaram comment on data-type above. |
|
Test build #43487 has finished for PR 8984 at commit
|
|
Thanks, @felixcheung @shivaram. I have committed my changes and tests have passed :-) |
R/pkg/inst/tests/test_sparkSQL.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test with some other types ? Also another one which runs into the NA case and uses the SQL type would be useful.
|
Test build #43495 has finished for PR 8984 at commit
|
|
Test build #43498 has finished for PR 8984 at commit
|
|
Jenkins, retest this please |
|
Test build #43499 has finished for PR 8984 at commit
|
|
@shaneknapp This one seems to be failing with Any idea whats up ? |
|
@felixcheung I have tried quite a few things already but unfortunately, I haven't been able to do the rebase. Could you provide some suggestions? Thanks! |
|
@olarayej Do the git merge commands in #8984 (comment) not work ? |
|
@shivaram @felixcheung What's confusing for us is that every time we run a fetch followed by a merge, it triggers conflicts with a number of files that we haven't modified (even outside the R folder). After I solved all conflicts, and ran a push, it also pushed those files. Now there are 194 modified files, which makes things pretty messy. I'm thinking about creating a new branch and discard this one. Thoughts? |
|
Test build #45398 has finished for PR 8984 at commit
|
0bc5b35 to
ba091fb
Compare
|
Test build #45400 has finished for PR 8984 at commit
|
|
Yeah something seems to be messed up. You shouldn't get other files changed if you do a fetch + merge as long as the rest of your tree is synced to the same place. You can open a new PR if you feel that its getting messy in this case -- The only downside is that we lose all these comments we had etc. but since this PR is close to being merged its probably fine in this case. |
|
I have created a new branch and PR #9579 to follow up on this. |
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged. Author: Oscar D. Lara Yejas <[email protected]> Closes #9579 from olarayej/SPARK-10863_NEW14. (cherry picked from commit 47735cd) Signed-off-by: Shivaram Venkataraman <[email protected]>
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged. Author: Oscar D. Lara Yejas <[email protected]> Closes #9579 from olarayej/SPARK-10863_NEW14.
|
@olarayej Could you close this PR ? Only the person who opened the PR can close it and it helps clear our PR queue at https://spark-prs.appspot.com/#r |
|
Closing this PR as #9579 has been created to follow up.... |
Method coltypes() to get R's data types of a DataFrame