Skip to content

Conversation

@gatorsmile
Copy link
Member

When the dataset is encoded, the existing display looks strange. Using comma-separated decimal array is not common when the type is binary.

    implicit val kryoEncoder = Encoders.kryo[KryoClassData]
    val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS()
    ds.show(20, false);

The current output is like

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After the fix, it will be like the below

+----------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------+
|[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 61 02]|
|[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 62 04]|
|[01 00 6F 72 67 2E 61 70 61 63 68 65 2E 73 70 61 72 6B 2E 73 71 6C 2E 4B 72 79 6F 43 6C 61 73 73 44 61 74 E1 01 01 82 63 06]|
+----------------------------------------------------------------------------------------------------------------------------+

In addition, do we need to add a new method to decode and then display the contents?

@SparkQA
Copy link

SparkQA commented Dec 6, 2015

Test build #47244 has finished for PR 10165 at commit 5d0d64c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

Shoud we print the decoded values(user objects) in Dataset.show? cc @marmbrus @rxin

@gatorsmile
Copy link
Member Author

I have the exact same question when calling the show function. From the perspectives of users, they might not care the encoded values at all when calling the function show. The results of encoded values look weird to most users, I think.

@marmbrus
Copy link
Contributor

marmbrus commented Dec 7, 2015

Showing hex for binary columns seems reasonable, though we should probably truncate if its long. Showing the object representation for oparquely encoded values also seems reasonable but is probably not super easy to implement so I'd put it in a different PR if we are going to do it.

@gatorsmile
Copy link
Member Author

@marmbrus Agree.

It will truncate if we use the default value. For example,

ds.show(20);

For showing the decoded values, I can work on it.

Let me know if the current PR is ok to merge. Thanks!

@cloud-fan
Copy link
Contributor

the truncate logic is already in DataFrame.showString: if (truncate && str.length > 20) str.substring(0, 17) + "..." else str, so this LGTM.

@gatorsmile
Copy link
Member Author

Thank you! @cloud-fan

Will this PR be merged to 1.6? Or waiting for another PR for showing decoded values? @marmbrus Thank you!

@marmbrus
Copy link
Contributor

Can you add a test?

@gatorsmile
Copy link
Member Author

Sure, will do it.

Update: #10215 combines the code changes in this PR, because it has a conflict with that PR. Thus, the newly added test case is also added into that PR. Thanks!

asfgit pushed a commit that referenced this pull request Dec 16, 2015
Based on the suggestions from marmbrus cloud-fan in #10165 , this PR is to print the decoded values(user objects) in `Dataset.show`
```scala
    implicit val kryoEncoder = Encoders.kryo[KryoClassData]
    val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS()
    ds.show(20, false);
```
The current output is like
```
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
After the fix, it will be like the below if and only if the users override the `toString` function in the class `KryoClassData`
```scala
override def toString: String = s"KryoClassData($a, $b)"
```
```
+-------------------+
|value              |
+-------------------+
|KryoClassData(a, 1)|
|KryoClassData(b, 2)|
|KryoClassData(c, 3)|
+-------------------+
```

If users do not override the `toString` function, the results will be like
```
+---------------------------------------+
|value                                  |
+---------------------------------------+
|org.apache.spark.sql.KryoClassData68ef|
|org.apache.spark.sql.KryoClassData6915|
|org.apache.spark.sql.KryoClassData693b|
+---------------------------------------+
```

Question: Should we add another optional parameter in the function `show`? It will decide if the function `show` will display the hex values or the object values?

Author: gatorsmile <[email protected]>

Closes #10215 from gatorsmile/showDecodedValue.
@gatorsmile
Copy link
Member Author

Thank you!

@gatorsmile gatorsmile closed this Dec 16, 2015
marmbrus pushed a commit to marmbrus/spark that referenced this pull request Jan 7, 2016
Based on the suggestions from marmbrus cloud-fan in apache#10165 , this PR is to print the decoded values(user objects) in `Dataset.show`
```scala
    implicit val kryoEncoder = Encoders.kryo[KryoClassData]
    val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS()
    ds.show(20, false);
```
The current output is like
```
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
After the fix, it will be like the below if and only if the users override the `toString` function in the class `KryoClassData`
```scala
override def toString: String = s"KryoClassData($a, $b)"
```
```
+-------------------+
|value              |
+-------------------+
|KryoClassData(a, 1)|
|KryoClassData(b, 2)|
|KryoClassData(c, 3)|
+-------------------+
```

If users do not override the `toString` function, the results will be like
```
+---------------------------------------+
|value                                  |
+---------------------------------------+
|org.apache.spark.sql.KryoClassData68ef|
|org.apache.spark.sql.KryoClassData6915|
|org.apache.spark.sql.KryoClassData693b|
+---------------------------------------+
```

Question: Should we add another optional parameter in the function `show`? It will decide if the function `show` will display the hex values or the object values?

Author: gatorsmile <[email protected]>

Closes apache#10215 from gatorsmile/showDecodedValue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants