-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-4176][WIP] Support decimal types with precision > 18 in parquet #6796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4176][WIP] Support decimal types with precision > 18 in parquet #6796
Conversation
|
Note: I came across https://issues.apache.org/jira/browse/SPARK-8342 while testing, it seems like Decimal math is unsafe at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of d is (Int, Int). So I think it already contains precision and scale, why to add s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It caused a warning on my system. So I thought it would be better to make it explicit.
I could drop that line from this patch, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be correct. Scala pattern extractors use tuples if they want to return multiple values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said it was only about a warning, not about correctness. I'll drop this change on the next version, it draws too much attention and is not needed.
32630df to
8f6445c
Compare
|
ok to test |
|
Test build #35062 has finished for PR 6796 at commit
|
8f6445c to
7310902
Compare
|
Test build #35115 has finished for PR 6796 at commit
|
|
Thanks for working on this! Did a quick pass and it looks pretty good. I'll let @liancheng do a more complete review. Does SPARK-8342 block merging this or can you remove the WIP tag now? |
|
The only reason for the WIP is that I have not yet cross-tested the interoperability with e.g. hive. It follows the spec, but I'd like to test it (or have s.o. verify this). |
|
Ah cool, it might even be a good idea to check in small files that are created with other systems for these kinds of tests. |
|
I was thinking about this: Create a small parquet file with spark, load it with hive, copy it to a new parquet table with hive, read that with spark. If that matches the input -> win. Otherwise -> some more work. PS: SPARK-8342 / SPARK-8359 were only problems during my initial tests. It's a bit harder to test read/write of Decimal if the implementation has bugs. So those are unrelated to this patch, but they might reduce the usefulness of this patch (you can't do reliable math in the ranges you could now load/save) |
|
Are you thinking about doing this as part of the test or doing it manually? Right now parquet and its tests have no hive dependencies, which I think is good. But I would definitely like to have a test that reads a file that was written by hive/impala/etc (perhaps created manually and checked in). /cc @liancheng who has also been working on parquet interop. This could also maybe come as a follow-up PR if we want to add the interop tests in one go. |
|
Yes, manually. I could add the file I was writing afterwards, sounds like a good idea. |
|
Just did a test with hive, I can declare a parquet file written with spark as and it does work. I'm now trying to test the opposite direction plus a test case. I've also dropped the WIP. Doesn't make sense anymore. |
|
(hive 1.2.0 and hadoop 2.7.0 without hdfs or a cluster) |
|
Ok, it looks like I can't open hive generated parquet files, but it looks more like a type error. Hm, could be that the spark decoder is too strict. There are various ways to encode DECIMAL(30), and it looks like hive chooses fixed_len arrays, while I prefer variable length arrays. Have to double check that. |
|
I've pushed the hive generated parquet file and I'll call it a day. I think I'll have to relax the validation of column types for DECIMAL. |
|
Test build #35317 has finished for PR 6796 at commit
|
|
Ok, I think I'm slowly getting down to the cause.... The relevant class for the job setup is ParquetTableScan (doExecute). I'm not yet sure if this can be fixed on the job setup or on the receiver side. |
|
The problematic line is is replaced by I removed that line and the loading of the data works. |
|
I've pushed a very early version of a fix. (Literally early, it's nearly 1:00 am. And I'd expect the test build to fail, I'll fix the outstanding issues later today) PS: Loading the hive parquet works now, but I've not yet tested much more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be
parquetSchema.filter(_.containsField(name)).map(_.getType(name))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, an early version had that, I somehow moved to this verbose code O.o
Thanks
Am 21. Juni 2015 02:33:15 MESZ, schrieb Davies Liu [email protected]:
toThriftSchemaNames: Boolean = false): ParquetType = {
- val parquetElementTypeBySchema =
These could be
parquetSchema.filter(_.containsField(name)).map(_.getType(name))
Reply to this email directly or view it on GitHub:
https://github.com/apache/spark/pull/6796/files#r32889401
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also performs a type check / conversion. That's why I've removed it. It would look like this
val parquetElementTypeBySchema =
parquetSchema.filter(_.isInstanceOf[ParquetGroupType]).filter(_.containsField(name)).map(_.getType(name))
I would settle on collect, does that look ok?
val parquetElementTypeBySchema = parquetSchema.collect {
case gType : ParquetGroupType if (gType.containsField(name)) => gType.getType(name)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is better
|
Test build #35373 has finished for PR 6796 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should int32, int64 if possible, see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
DECIMAL can be used to annotate the following types:
int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision <= 10 will produce a warning
fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we gain by encoding it that way?
We use a minimal length fixed byte array which should provide a similar compact encoding. (DECIMAL(9) should end up as 4 bytes, and smaller decimal values should take even less space)
Decoding is a different story, though.
PS: I was focusing on DECIMAL with precision >=19. Shouldn't small decimal handling be a new ticket?
Am 21. Juni 2015 02:41:44 MESZ, schrieb Davies Liu [email protected]:
@@ -229,11 +231,15 @@ private[parquet] object ParquetTypesConverter
extends Logging {
case LongType =>
Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT64))
case TimestampType =>
Some(ParquetTypeInfo(ParquetPrimitiveTypeName.INT96))
case DecimalType.Fixed(precision, scale) if precision <= 18 =>
in a Long// TODO: for now, our writer only supports decimals that fitSome(ParquetTypeInfo(ParquetPrimitiveTypeName.FIXED_LEN_BYTE_ARRAY,
We should int32, int64 if possible, see
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimalDECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision <= 10 will produce a warning fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.
Reply to this email directly or view it on GitHub:
https://github.com/apache/spark/pull/6796/files#r32889441
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using int32 and int64 makes encoding and decoding faster since they don't introduce boxing costs. But I agree that should be made in another PR.
464d24e to
f973b58
Compare
|
Test build #35414 has finished for PR 6796 at commit
|
f973b58 to
8ff6603
Compare
|
Test build #35545 has finished for PR 6796 at commit
|
|
Hi @liancheng, thank you for the thorough review, will push a reworked version soon. Everything sounds reasonable :-) With "private" Settings I meant that I can't change the setting in the shell because it's marked as "isPublic = false" in https://github.com/liancheng/spark/blob/2a2062d3f530ecd26e75b306aee42761d67d8724/sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala#L273 I'm not sure if that's intended. |
|
@rtreffer Yeah, it's intended. As explained above, this feature flag must be set to |
|
Hey @rtreffer, just want to make sure whether you are still working on this? I'm asking because I just opened #7231 to refactor Parquet read path for interoperability and backwards-compatibility, which also touches the decimal parts. I believe the new |
|
Hi @liancheng, I'm rebasing on you PR right now. I can work for ~1-2h / day on this PR so feel free to take over the PR if this blocks anything. |
5fe321e to
e6dad45
Compare
|
Test build #36702 has finished for PR 6796 at commit
|
e6dad45 to
7a57c16
Compare
|
The writeDecimal method is rather ugly, and the write path needs to know if we follow parquet style or not as this implies a different encoding (addInteger / addLong). |
|
Test build #36703 has finished for PR 6796 at commit
|
7a57c16 to
1152721
Compare
|
Test build #37097 has finished for PR 6796 at commit
|
1152721 to
3e30bdf
Compare
|
Test build #37099 has finished for PR 6796 at commit
|
3e30bdf to
c8d4d6c
Compare
|
Test build #37130 has finished for PR 6796 at commit
|
|
Test build #1055 has finished for PR 6796 at commit
|
c8d4d6c to
1703c26
Compare
|
Test build #37143 has finished for PR 6796 at commit
|
06d337a to
83ca029
Compare
…quets fixed_byte_array Parquet defines multiple ways to store decimals. This patch enables the reading of all variations as well as writing decimals in the smallest fixed-length container possible (INT32, INT64, FIXED_LEN_BYTE_ARRAY).
83ca029 to
1dad677
Compare
|
Test build #37146 has finished for PR 6796 at commit
|
|
Test build #37147 has finished for PR 6796 at commit
|
|
@liancheng sure, I just wasn't sure if it should be closed :-) |
This PR is based on #6796 authored by rtreffer. To support large decimal precisions (> 18), we do the following things in this PR: 1. Making `CatalystSchemaConverter` support large decimal precision Decimal types with large precision are always converted to fixed-length byte array. 2. Making `CatalystRowConverter` support reading decimal values with large precision When the precision is > 18, constructs `Decimal` values with an unscaled `BigInteger` rather than an unscaled `Long`. 3. Making `RowWriteSupport` support writing decimal values with large precision In this PR we always write decimals as fixed-length byte array, because Parquet write path hasn't been refactored to conform Parquet format spec (see SPARK-6774 & SPARK-8848). Two follow-up tasks should be done in future PRs: - [ ] Writing decimals as `INT32`, `INT64` when possible while fixing SPARK-8848 - [ ] Adding compatibility tests as part of SPARK-5463 Author: Cheng Lian <[email protected]> Closes #7455 from liancheng/spark-4176 and squashes the following commits: a543d10 [Cheng Lian] Fixes errors introduced while rebasing 9e31cdf [Cheng Lian] Supports decimals with precision > 18 for Parquet
This is my current WIP on SPARK-4176. It should be compatible with other implementations of parquet.
https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal
This is the default encoding on bigint. It should thus be compatible with other implementations, although it would be great if s.o. could test this.
I've tested this locally with powers of 2 up to 2^200 in the spark shell, without errors but
Code I've used for (local) testing (on spark shell):