Skip to content

Conversation

@AdrianOlosutean
Copy link
Contributor

@AdrianOlosutean AdrianOlosutean commented Jan 13, 2022

Closes #22

@AdrianOlosutean AdrianOlosutean changed the title Feature/21 enceladus schema utils Feature/22 enceladus schema utils Jan 13, 2022
@AdrianOlosutean AdrianOlosutean changed the base branch from master to feature/19-add-columnimplicits January 13, 2022 14:24
…us-schema-utils

# Conflicts:
#	src/test/scala/za/co/absa/spark/commons/implicits/StructFieldImplicitsTest.scala
@AdrianOlosutean AdrianOlosutean marked this pull request as ready for review January 20, 2022 11:19
Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing, that the feature branch is marked as feature/21, the PR topic is feature/22.

Aside from the comment I have made, LGTM. (read, checked out, ran some integration tests)

val maxSpark2XVersionExcluded: SemanticVersion = semver"3.0.0"

val minSpark3XVersionIncluded: SemanticVersion = semver"3.0.0"
val maxSpark3XVersionExcluded: SemanticVersion = semver"4.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems tricky. When this was in Enceladus, we could set the highest version that we have tried and could have adjusted the max version as new versions get released and tested.

With being moved to commons, this logic no longer applies, because the check can be used elsewhere with different compatibility requirements. This leads me to believe that in this general case, the method fromSpark3XCompatibilitySettings could maybe only force the minimal Spark version. Can be discussed.

@AdrianOlosutean
Copy link
Contributor Author

It's a little confusing, that the feature branch is marked as feature/21, the PR topic is feature/22.

True, that was a mistake when naming the branch

@AdrianOlosutean AdrianOlosutean changed the title Feature/22 enceladus schema utils 22 enceladus schema utils Jan 25, 2022
…us-schema-utils

# Conflicts:
#	README.md
#	src/main/scala/za/co/absa/spark/commons/schema/SchemaUtils.scala
#	src/test/scala/za/co/absa/spark/commons/implicits/StructFieldImplicitsTest.scala
dk1844
dk1844 previously approved these changes Jan 25, 2022
Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now (just read the code this time)

Zejnilovic
Zejnilovic previously approved these changes Jan 28, 2022
Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for one comment on readme, looks good.

Comment on lines 82 to 91

1. Determine the name of a field overriden by metadata

```scala
structField.getFieldNameOverriddenByMetadata()
```

Of them, metadata methods are:

1. Gets the metadata Option[String] value given a key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems off

Copy link
Contributor

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the tests have the suffix Test sometimes Suite. Would look better to be consistent.

*
* @return Metadata "sourcecolumn" if it exists or field.name
*/
def getFieldNameOverriddenByMetadata(): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questioning its SparkCommons status.
(There might be more of these, when I think, they are too (Enceladus) specific, to be placed to SparkCommons. Happy to discuss their status.)

* @return the keys of the returned map are the columns' names after renames, the values are the source columns;
* the name are full paths denoted with dot notation
*/
def getRenamesInSchema(includeIfPredecessorChanged: Boolean = true): Map[String, String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questioning its SparkCommons status.

* @param path The path to the attribute
* @return The path of the first array field or "" if none were found
*/
def getFirstArrayPath(path: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questioning its SparkCommons status.

* @param path The path to the attribute
* @return Seq of dot-separated paths for all array fields in the provided path
*/
def getAllArraysInPath(path: String): Seq[String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questioning its SparkCommons status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving

* @param fieldPathName A field to check
* @return true if the specified field is an array
*/
def isArray(fieldPathName: String): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot be this effectively replaced with the getFieldType?

* @param path the fully qualified field name
* @return unique top level field name
*/
def unpath(path: String): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questioning its SparkCommons status.
This is really weird function. I mean it works in our cases, but it rather utilitarian, not very conceptual, with a cryptic name.

* @return A non-array data type at the bottom of array nesting
*/
@tailrec
final def getDeepestArrayType(arrayType: ArrayType): DataType = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questioning its SparkCommons status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add implicits for arraytype

@AdrianOlosutean AdrianOlosutean changed the base branch from feature/19-add-columnimplicits to master February 8, 2022 06:59
@AdrianOlosutean AdrianOlosutean dismissed stale reviews from Zejnilovic and dk1844 February 8, 2022 06:59

The base branch was changed.

…s-schema-utils

# Conflicts:
#	README.md
#	src/main/scala/za/co/absa/spark/commons/implicits/StructFieldImplicits.scala
#	src/test/scala/za/co/absa/spark/commons/implicits/ColumnImplicitsTest.scala
#	src/test/scala/za/co/absa/spark/commons/implicits/StructFieldImplicitsTest.scala
dk1844
dk1844 previously approved these changes Feb 10, 2022
Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM (just read the code)

README.md Outdated
Comment on lines 144 to 148
1. Get a field from a text path

```scala
arrayType.isEquivalentArrayType(otherArrayType)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the description does not seem to match the method

README.md Outdated
Comment on lines 161 to 165
1. Get a field from a text path

```scala
dataType.isEquivalentDataType(otherDt)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines +183 to +187
1. Get a field from a text path

```scala
structType.getField(path)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the source where it made sense.

* @return true if provided arrays are the same ignoring nullability
*/
@scala.annotation.tailrec
final def isEquivalentArrayType(other: ArrayType): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem very similar to DataTypeEnhancements(dt: DataType).isEquivalentDataType(). Couldn't one of these methods use the other instead of containing the same logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix only obvious errors now. We can improve and add items in next release - a minor.
Adrian rightly pointed out, endless improvements prevent release, and therefor usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, makes sense.

…s-schema-utils

# Conflicts:
#	src/test/scala/za/co/absa/spark/commons/implicits/DataFrameImplicitsTest.scala
#	src/test/scala/za/co/absa/spark/commons/schema/SchemaUtilsSpec.scala
Copy link
Contributor

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another good work 👍

@AdrianOlosutean AdrianOlosutean merged commit e6289b8 into master Feb 11, 2022
@AdrianOlosutean AdrianOlosutean deleted the feature/21-enceladus-schema-utils branch February 11, 2022 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Enceladus utils SchemaUtils

5 participants