[SPARK-10648] Proposed bug fix when oracle returns -127 as a scale to a numeric type #8780

travishegner · 2015-09-16T20:30:44Z

In my environment, the precision and scale are undefined in the oracle database, but spark is detecting them to be 0 and -127 respectively.

If I understand those two values correctly, they should never logically be defined as less than zero, so the proposed changes should correctly default the precision and scale instead of trying to use erroneous values.

If there is a valid use case of a negative precision or scale, then I can re-work this to test for the exact 0 AND -127 case and handle it appropriately.

holdenk · 2015-09-16T21:13:45Z

So I understand, the goal of this patch is that if an invalid value is returned (e.g. a precision or scale <= 0), then the defaults are used yes?

travishegner · 2015-09-16T22:34:46Z

Yes, that is the intention. Is this the proper way to address that issue?

On Wed, Sep 16, 2015, 5:14 PM Holden Karau [email protected] wrote:

So I understand, the goal of this patch is that if an invalid value is
returned (e.g. a precision or scale <= 0), then the defaults are used yes?

—
Reply to this email directly or view it on GitHub
#8780 (comment).

holdenk · 2015-09-16T22:39:13Z

I'm not super sure on that, one question would be if this is reasonable behavior for all databases or only Oracle.

travishegner · 2015-09-17T01:04:06Z

I'm not sure if oracle can be associated with anything reasonable, but sometimes you have to play the hand you are dealt. :)

I can only answer your question with a question... Would there ever be a use case in the Decimal() class where the precision and/or the scale would be set to a negative value?

If not, then I'd have to imagine that this patch makes the intention of the code more accurate across the board. If yes, then I may have to explore an oracle-only type of patch, which may or may not ever be committed back, depending on it's usefulness to the community.

I'd have to assume that there isn't a use case for negative values given the way precision and scale are used and defined, but you'll have to forgive any ignorance on my part as I'm still fairly new to scala. I hadn't even browsed the source for spark until about one week ago. I'm still in the alpha stages of even testing spark in general, so while it's seemingly solved the problem for me in my testing, I could easily be overlooking something.

rxin · 2015-09-18T05:40:36Z

Actually scale can be negative. It just means the number of 0s to the left of decimal point.

For example, for number 123, precision = 2 and scale = -1, then 123 would become 120.

rxin · 2015-09-18T05:41:43Z

(I actually don't know if Spark implements this correctly -- we should test it)

travishegner · 2015-09-18T14:14:26Z

That is exactly what I was afraid of. Would the patch make more sense to only check precision for a zero value? Does it ever make sense to have a precision of zero (or less than zero for that matter)? Could we safely enforce defaults if precision is zero (or less) regardless of scale? That would solve my problem still, hopefully without compromising functionality for everyone else.

rxin · 2015-09-19T02:18:53Z

That should work.

On Sep 18, 2015, at 7:15 AM, Travis Hegner [email protected] wrote:

That is exactly what I was afraid of. Would the patch make more sense to
only check precision for a zero value? Does it ever make sense to have a
precision of zero (or less than zero for that matter)? Could we safely
enforce defaults if precision is zero (or less) regardless of scale? That
would solve my problem still, hopefully without compromising functionality
for everyone else.

—
Reply to this email directly or view it on GitHub
#8780 (comment).

…egal for scale to be < 0

travishegner · 2015-09-21T14:47:32Z

Working on a new patch... Would it ever be possible to have a case where precision is 0 (essentially undefined), but scale is still intentionally set? Or is it that setting the precision is required in order to set a scale?

rxin · 2015-09-21T16:40:13Z

They would all be null then. It doesn't make sense to have precision < scale.

travishegner · 2015-09-21T17:12:04Z

But a negative scale is inherently less than a defined precision... or do you mean precision should never be less than the absolute value of scale? Is that something that should be tested for, and overridden with defaults if true?

This also would fix my problem without DB specific hacks.

rxin · 2015-09-21T20:08:43Z

Oh actually - let me correct it.

If scale is positive, then precision needs to >= scale.

If scale is negative, then precision can be anything (>0).

I'm not sure if precision == 0 makes any sense, since that effectively means null value.

travishegner · 2015-09-21T20:33:41Z

OK, after looking at this a little further, it seems that DecimalType.bounded() should be called regardless of precision and scale values in JDBCRDD.scala, and then let the .bounded() method validate the precision and scale values. If it finds invalid values, it can return defaults, or throw an exception, depending on which condition is met. I'm going to move in that direction for this.

I'm still a little confused by the precision and scale rules that you defined. Wouldn't it be invalid to have a precision = 10, but a scale = -20? Wouldn't that always result in a null value, or am I still completely interpreting precision and scale incorrectly?

rxin · 2015-09-21T20:46:52Z

precision = 10 and scale = -20 should be fine.

scala> Seq((121, 134)).toDF("a","b")
res0: org.apache.spark.sql.DataFrame = [a: int, b: int]

scala>  import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> res0.select($"a".cast(DecimalType(2, -1))).show()
+------------------------+
|cast(a as decimal(2,-1))|
+------------------------+
|                  1.2E+2|
+------------------------+


scala> 

scala> res0.select($"a".cast(DecimalType(1, -2))).show()
+------------------------+
|cast(a as decimal(1,-2))|
+------------------------+
|                    1E+2|
+------------------------+

…scala, and put them into bounded function

travishegner · 2015-09-22T13:03:22Z

I'm making sure the new version builds, but here are the rules:

  private[sql] def bounded(precision: Int, scale: Int): DecimalType = {
    if (precision <= 0)
      DecimalType.SYSTEM_DEFAULT
    else if (scale > precision)
      DecimalType(min(precision, MAX_PRECISION), min(precision, MAX_SCALE))
    else
      DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
  }

For both Decimal and Numeric types in JDBCRDD.scala, it now will call the DecimalType.bounded() regardless the values of precision and scale. This allows all of the validity checks to happen inside of that function. The rules above are simply if precision is invalid, return a SYSTEM_DEFAULT. if scale > precision then force scale = precision. And all else retains previous behavior.

Once I verify that it builds and runs, I will update the pull request. Does this look accurate?

travishegner · 2015-09-29T13:13:52Z

So any thoughts on merging this?

rxin · 2015-09-30T18:38:06Z

cc @davies to take a quick look

davies · 2015-09-30T18:50:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

How about change this to scale >= 0?

That wouldn't work as demonstrated earlier in the thread. A negative scale is legal to reduce precision to the 10s place, 100s place, etc...

cloud-fan · 2015-09-30T21:03:07Z

I met this problem before, and actually it's not spark that detect them to be 0 and -127, but JDBC. My solution is just adding a OracleDialect to handle this sepcial case:

case object OracleDialect extends JdbcDialect {
  override def canHandle(url: String): Boolean = url.startsWith("jdbc:oracle")
  override def getCatalystType(
      sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
    if (sqlType == Types.NUMERIC && typeName == "NUMBER" && size == 0) Some(LongType) else None
  }
}

......

JdbcDialects.registerDialect(OracleDialect)

So if we want to support oracle officially, we can add OracleDialect into spark like we did for mysql, postgre, etc. Or @travishegner you can try OracleDialect to fix your problem without changing spark code.

bdolbeare · 2015-10-05T17:00:21Z

The problem with Oracle is that you can define numbers without providing precision or scale:

column_name NUMBER  (this is the only case that doesn't work very well for Oracle support)
    has a precision of 0 and scale of -127 in JDBC ResultSetMetaData
    "If a precision is not specified, the column stores values as given."  http://docs.oracle.com/cd/B28359_01/server.111/b28318/datatype.htm#i16209        

column_name NUMBER(10) 
    has a precision of 10 and scale of 0 in JDBC ResultSetMetaData

column_name NUMBER(10,2) 
    has a precision of 10 and scale of 2 in JDBC ResultSetMetaData

I think the best solution is to handle this in a OracleDialect since this is a quirk of Oracle. I've done that for my own code but it would be nice to have two changes in Spark:

Access to more of the metadata fields (e.g. scale) in the dialect.getCatalystType call (currently the precision is provided but the scale is not)
Change line 406 in JDBCRDD to support creating a Decimal without a predefined precision/scale. It seems that this would work in cases where there is a consistent
precision/scale for a field and also this Oracle nuance where the precision/scale differ per row.

For now, this is what I've done in my own OracleDialect:

object OracleDialect extends JdbcDialect {
  override def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = {
    // Handle NUMBER fields that have no precision/scale in special way because JDBC ResultSetMetaData converts this to 0 procision and -127 scale
    if (sqlType == Types.NUMERIC && size == 0) {
      // This is sub-optimal as we have to pick a precision/scale in advance whereas the data in Oracle is allowed 
      //  to have different precision/scale for each value.  This conversion works in our domain for now though we 
      //  need a more durable solution.  Look into changing JDBCRDD (line 406):
      //    FROM:  mutableRow.update(i, Decimal(decimalVal, p, s))
      //    TO:  mutableRow.update(i, Decimal(decimalVal))
      Some(DecimalType(DecimalType.MAX_PRECISION, 10))
    } // Handle Timestamp with timezone (for now we are just converting this to a string with default format)
    else if (sqlType == -101) {
      Some(StringType)
    } else None
  }
}

travishegner · 2015-10-05T17:40:16Z

@cloud-fan @bdolbeare @davies I'm certainly open to doing this in an oracle specific way if that is what is required. I was simply hoping to solve my problem while simultaneously making the whole project more robust. I completely understand if you don't believe that it's the right direction. Thanks for looking into it with me!

rxin · 2015-10-05T23:01:57Z

@travishegner looks like it is best to just do it in the oracle dialect.

yhuai · 2015-11-05T00:56:17Z

@travishegner Will you have time to continue your work? I think our resolution is to create a oracle dialect and we automatically register it (see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala).

SparkQA · 2015-11-05T09:33:09Z

Test build #1986 has finished for PR 8780 at commit d11141c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

travishegner · 2015-11-05T15:11:05Z

Please see PR #9495 for the oracle dialect solution proposed above.

This is the alternative/agreed upon solution to PR #8780. Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle. Author: Travis Hegner <[email protected]> Closes #9495 from travishegner/OracleDialect. (cherry picked from commit 14ee0f5) Signed-off-by: Yin Huai <[email protected]>

This is the alternative/agreed upon solution to PR #8780. Creating an OracleDialect to handle the nonspecific numeric types that can be defined in oracle. Author: Travis Hegner <[email protected]> Closes #9495 from travishegner/OracleDialect.

Proposed bug fix when oracle returns -127 as a scale to a numeric type

b0c3be3

travishegner added 2 commits September 21, 2015 10:40

Merge remote-tracking branch 'upstream/master'

01b89bd

only testing for precision > 0, rather than scale also, since it is l…

9b1d407

…egal for scale to be < 0

travishegner added 2 commits September 22, 2015 08:50

Merge remote-tracking branch 'upstream/master'

8f2f3fd

removed any validity checking for decimal/numeric types from JDBCRDD.…

8110d30

…scala, and put them into bounded function

Merge remote-tracking branch 'upstream/master'

d11141c

davies reviewed Sep 30, 2015
View reviewed changes

travishegner mentioned this pull request Nov 5, 2015

[SPARK-10648] Oracle dialect to handle nonspecific numeric types #9495

Closed

travishegner closed this Nov 5, 2015

[SPARK-10648] Proposed bug fix when oracle returns -127 as a scale to a numeric type #8780

[SPARK-10648] Proposed bug fix when oracle returns -127 as a scale to a numeric type #8780

Uh oh!

Conversation

travishegner commented Sep 16, 2015

Uh oh!

holdenk commented Sep 16, 2015

Uh oh!

travishegner commented Sep 16, 2015

Uh oh!

holdenk commented Sep 16, 2015

Uh oh!

travishegner commented Sep 17, 2015

Uh oh!

rxin commented Sep 18, 2015

Uh oh!

rxin commented Sep 18, 2015

Uh oh!

travishegner commented Sep 18, 2015

Uh oh!

rxin commented Sep 19, 2015

Uh oh!

travishegner commented Sep 21, 2015

Uh oh!

rxin commented Sep 21, 2015

Uh oh!

travishegner commented Sep 21, 2015

Uh oh!

rxin commented Sep 21, 2015

Uh oh!

travishegner commented Sep 21, 2015

Uh oh!

rxin commented Sep 21, 2015

Uh oh!

travishegner commented Sep 22, 2015

Uh oh!

travishegner commented Sep 29, 2015

Uh oh!

rxin commented Sep 30, 2015

Uh oh!

davies Sep 30, 2015

Choose a reason for hiding this comment

Uh oh!

travishegner Oct 5, 2015

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 30, 2015

Uh oh!

bdolbeare commented Oct 5, 2015

Uh oh!

travishegner commented Oct 5, 2015

Uh oh!

rxin commented Oct 5, 2015

Uh oh!

yhuai commented Nov 5, 2015

Uh oh!

SparkQA commented Nov 5, 2015

Uh oh!

travishegner commented Nov 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants