Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 19, 2024

What changes were proposed in this pull request?

1, Make function count_min_sketch accept number arguments;
2, Make argument seed optional;
3, fix the type hints of eps/confidence/seed from ColumnOrName to Column, because they require a foldable value and actually do not accept column name:

In [3]: from pyspark.sql import functions as sf

In [4]: df = spark.range(10000).withColumn("seed", sf.lit(1).cast("int"))

In [5]: df.select(sf.hex(sf.count_min_sketch("id", sf.lit(0.5), sf.lit(0.5), "seed")))
...
AnalysisException: [DATATYPE_MISMATCH.NON_FOLDABLE_INPUT] Cannot resolve "count_min_sketch(id, 0.5, 0.5, seed)" due to data type mismatch: the input `seed` should be a foldable "INT" expression; however, got "seed". SQLSTATE: 42K09;
'Aggregate [unresolvedalias('hex(count_min_sketch(id#1L, 0.5, 0.5, seed#2, 0, 0)))]
+- Project [id#1L, cast(1 as int) AS seed#2]
   +- Range (0, 10000, step=1, splits=Some(12))
...

Why are the changes needed?

1, seed is optional in other similar functions;
2, existing type hint is ColumnOrName which is misleading since column name is not actually supported

Does this PR introduce any user-facing change?

yes, it support number arguments

How was this patch tested?

updated doctests

Was this patch authored or co-authored using generative AI tooling?

no

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need a connect version as well.

Copy link
Contributor Author

@zhengruifeng zhengruifeng Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! let me add one for Scala client.

Copy link
Contributor Author

@zhengruifeng zhengruifeng Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon it seems we don't have a separate functions.scala for Scala client now, after a lot of refactoring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: , optional in the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!

@zhengruifeng zhengruifeng force-pushed the py_fix_count_min_sketch branch from 5164098 to 078792f Compare September 19, 2024 08:37
@zhengruifeng zhengruifeng force-pushed the py_fix_count_min_sketch branch from 56f5793 to c08d9d9 Compare September 19, 2024 10:53
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@zhengruifeng
Copy link
Contributor Author

thanks @dongjoon-hyun @xinrong-meng and @HyukjinKwon

merged to master

@zhengruifeng zhengruifeng deleted the py_fix_count_min_sketch branch September 20, 2024 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants