-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-10863][SPARKR] Method coltypes() to get R's data types of a DataFrame #8984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
461714d
e9e34b5
c65b682
cee871c
0851163
7a8e62a
de6d164
a346cc6
6c4dcbc
99e6304
30c5d26
4a92d99
a68f97a
360156c
0c2da6c
909e4e3
3cd2079
a7723d9
0a0b278
7e89935
21c0799
fee5a2e
e1056ab
75f5ced
908abf4
37bdc46
3b5c2d5
001884a
eaaf178
9a9618e
25faa4e
57a47a4
e5ab466
772de99
67b12a4
0bb39dc
8aa13ef
9b36955
95a8ece
462b1f1
cd033c0
ba091fb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,4 +34,5 @@ Collate: | |
| 'serialize.R' | ||
| 'sparkR.R' | ||
| 'stats.R' | ||
| 'types.R' | ||
| 'utils.R' | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2152,3 +2152,58 @@ setMethod("with", | |
| newEnv <- assignNewEnv(data) | ||
| eval(substitute(expr), envir = newEnv, enclos = newEnv) | ||
| }) | ||
|
|
||
| #' Returns the column types of a DataFrame. | ||
| #' | ||
| #' @name coltypes | ||
| #' @title Get column types of a DataFrame | ||
| #' @param x (DataFrame) | ||
| #' @return value (character) A character vector with the column types of the given DataFrame | ||
| #' @rdname coltypes | ||
| setMethod("coltypes", | ||
| signature(x = "DataFrame"), | ||
| function(x) { | ||
| # TODO: This may be moved as a global parameter | ||
| # These are the supported data types and how they map to | ||
| # R's data types | ||
| DATA_TYPES <- c("string"="character", | ||
| "long"="integer", | ||
| "tinyint"="integer", | ||
| "short"="integer", | ||
| "integer"="integer", | ||
| "byte"="integer", | ||
| "double"="numeric", | ||
| "float"="numeric", | ||
| "decimal"="numeric", | ||
| "boolean"="logical" | ||
| ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You only handle primitive types here, but no complex types, like Array, Struct and Map. It would be better you can refactor the type mapping related code here and that in SerDe.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sun-rui For complex types (Array/Struct/Map), I can't think of any mapping to R types. Therefore, as agreed with @felixcheung and @shivaram, these will remain the same. For example: Original column types: ["string", "boolean", "map..."]
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @olarayej I think the fall back mechanism here is good. But @sun-rui makes another good point that it will be good to have one unified place where we do a mapping from R types to java types. Right now part of that is in serialize.R / deserialize.R Could you see if there is some refactoring we could do for this to not be duplicated ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sun-rui @shivaram In file serialize.R, method writeType (see below) turns the full data type into a one-character string. Then, method readTypedObject (see below), uses this one-character type to read accordingly. I suspect this is because complex types could be like map (String,String)? In my opinion, it would be better to use the full data type, as opposed to the first letter (which could be especially confusing since we support data types starting with the same letter Date/Double, String/Struct). Also, having the full data type would allow for centralizing the data types in one place, though this would require some major changes We could have mapping arrays: PRIMITIVE_TYPES <- c("string"="character", COMPLEX_TYPES <- c("map", "array", "struct", ...) DATA_TYPES <- c(PRIMITIVE_TYPES, COMPLEX_TYPES) And then we'd need to modify deserialize.R, serialize.R, and schema.R to acknowledge these accordingly. Thoughts? writeType <- function(con, class) { readTypedObject <- function(con, type) {
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The single character names are to reduce the amount of data serialized when we transfer these data types to the JVM. Its not meant to be remembered by anybody so I don't see it being a source of confusion. @sun-rui also added tests which ensure these mappings don't break. However I think having a list of primitive types, complex types and mapping in a common file (types.R ?) sounds good to me. |
||
|
|
||
| # Get the data types of the DataFrame by invoking dtypes() function | ||
| types <- sapply(dtypes(x), function(x) {x[[2]]}) | ||
|
|
||
| # Map Spark data types into R's data types using DATA_TYPES environment | ||
| rTypes <- sapply(types, USE.NAMES=F, FUN=function(x) { | ||
|
|
||
| # Check for primitive types | ||
| type <- PRIMITIVE_TYPES[[x]] | ||
| if (is.null(type)) { | ||
| # Check for complex types | ||
| typeName <- Filter(function(t) { substring(x, 1, nchar(t)) == t}, | ||
| names(COMPLEX_TYPES)) | ||
| if (length(typeName) > 0) { | ||
| type <- COMPLEX_TYPES[[typeName]] | ||
| } else { | ||
| stop(paste("Unsupported data type: ", x)) | ||
| } | ||
| } | ||
| type | ||
| }) | ||
|
|
||
| # Find which types don't have mapping to R | ||
| naIndices <- which(is.na(rTypes)) | ||
|
|
||
| # Assign the original scala data types to the unmatched ones | ||
| rTypes[naIndices] <- types[naIndices] | ||
|
|
||
| rTypes | ||
| }) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one or more | ||
| # contributor license agreements. See the NOTICE file distributed with | ||
| # this work for additional information regarding copyright ownership. | ||
| # The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| # (the "License"); you may not use this file except in compliance with | ||
| # the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # | ||
| # types.R. This file handles the data type mapping between Spark and R | ||
|
|
||
| # The primitive data types, where names(PRIMITIVE_TYPES) are Scala types whereas | ||
| # values are equivalent R types. This is stored in an environment to allow for | ||
| # more efficient look up (environments use hashmaps). | ||
| PRIMITIVE_TYPES <- as.environment(list( | ||
| "byte"="integer", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "byte" should be "tinyint" |
||
| "tinyint"="integer", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "smallint"="integer", |
||
| "smallint"="integer", | ||
| "integer"="integer", | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "int"="integer",
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "bigint"="numeric" |
||
| "bigint"="numeric", | ||
| "float"="numeric", | ||
| "double"="numeric", | ||
| "decimal"="numeric", | ||
| "string"="character", | ||
| "binary"="raw", | ||
| "boolean"="logical", | ||
| "timestamp"="POSIXct", | ||
| "date"="Date")) | ||
|
|
||
| # The complex data types. These do not have any direct mapping to R's types. | ||
| COMPLEX_TYPES <- list( | ||
| "map"=NA, | ||
| "array"=NA, | ||
| "struct"=NA) | ||
|
|
||
| # The full list of data types. | ||
| DATA_TYPES <- as.environment(c(as.list(PRIMITIVE_TYPES), COMPLEX_TYPES)) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -695,13 +695,6 @@ test_that("select with column", { | |
| expect_equal(columns(df3), c("x")) | ||
| expect_equal(count(df3), 3) | ||
| expect_equal(collect(select(df3, "x"))[[1, 1]], "x") | ||
|
|
||
| df4 <- select(df, c("name", "age")) | ||
| expect_equal(columns(df4), c("name", "age")) | ||
| expect_equal(count(df4), 3) | ||
|
|
||
| expect_error(select(df, c("name", "age"), "name"), | ||
| "To select multiple columns, use a character vector or list for col") | ||
| }) | ||
|
|
||
| test_that("subsetting", { | ||
|
|
@@ -1467,8 +1460,9 @@ test_that("SQL error message is returned from JVM", { | |
| expect_equal(grepl("Table not found: blah", retError), TRUE) | ||
| }) | ||
|
|
||
| irisDF <- createDataFrame(sqlContext, iris) | ||
|
|
||
| test_that("Method as.data.frame as a synonym for collect()", { | ||
| irisDF <- createDataFrame(sqlContext, iris) | ||
| expect_equal(as.data.frame(irisDF), collect(irisDF)) | ||
| irisDF2 <- irisDF[irisDF$Species == "setosa", ] | ||
| expect_equal(as.data.frame(irisDF2), collect(irisDF2)) | ||
|
|
@@ -1503,6 +1497,27 @@ test_that("with() on a DataFrame", { | |
| expect_equal(nrow(sum2), 35) | ||
| }) | ||
|
|
||
| test_that("Method coltypes() to get R's data types of a DataFrame", { | ||
| expect_equal(coltypes(irisDF), c(rep("numeric", 4), "character")) | ||
|
|
||
| data <- data.frame(c1=c(1,2,3), | ||
| c2=c(T,F,T), | ||
| c3=c("2015/01/01 10:00:00", "2015/01/02 10:00:00", "2015/01/03 10:00:00")) | ||
|
|
||
| schema <- structType(structField("c1", "byte"), | ||
| structField("c3", "boolean"), | ||
| structField("c4", "timestamp")) | ||
|
|
||
| # Test primitive types | ||
| DF <- createDataFrame(sqlContext, data, schema) | ||
| expect_equal(coltypes(DF), c("integer", "logical", "POSIXct")) | ||
|
|
||
| # Test complex types | ||
| x <- createDataFrame(sqlContext, list(list(as.environment( | ||
| list("a"="b", "c"="d", "e"="f"))))) | ||
| expect_equal(coltypes(x), "map<string,string>") | ||
| }) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a test with some other types ? Also another one which runs into the |
||
|
|
||
| unlink(parquetPath) | ||
| unlink(jsonPath) | ||
| unlink(jsonPathNa) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the style of function description to be more consistent with other existing ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this when updating my PR #9218