Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 6, 2016

What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-14432

As we have the underlying implementation to calculate the approximate quantiles for multiple columns, I think we have no reason only providing API to calculate the approximate quantiles for just one column at a time. We should add API to do multiple columns too.

How was this patch tested?

Add tests to DataFrameStatSuite.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55116 has finished for PR 12207 at commit 47d52b9.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 6, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55117 has finished for PR 12207 at commit 47d52b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 7, 2016

cc @jkbradley

StatFunctions.multipleApproxQuantiles(df, Seq(col), probabilities, relativeError).head.toArray
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have the full doc from the above method, we should perhaps provide an @see link to the full info about the algorithm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Updated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the @see link work (as in links to the method with full doc)? Can you build the docs on your PR and check it? I'm not totally sure whether it will point to the doc of the other method or just to itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it with specified parameter types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will actually show up in the generated Scaladoc HTML.

@jkbradley @mengxr do you prefer to actually make links show up in the HTML API doc? If so, then it often doesn't look good in an IDE. But to do that something like this is needed:
@see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for detailed description.

@MLnick
Copy link
Contributor

MLnick commented Apr 7, 2016

Thanks @viirya - any chance to update the PySpark API at the same time? :)

@viirya
Copy link
Member Author

viirya commented Apr 7, 2016

@MLnick Thanks for review. I've updated PySpark API.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55216 has finished for PR 12207 at commit 75edcb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

ping @jkbradley @MLnick any further comments for this?

self.assertEqual(len(aq), 3)
self.assertTrue(all(isinstance(q, float) for q in aq))

aqs = df.stat.approxQuantile(["a", "a"], [0.1, 0.5, 0.9], 0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add an assert that len(aqs) is 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55349 has finished for PR 12207 at commit 619660d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55350 has finished for PR 12207 at commit b64bd4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

@jkbradley Can you take a look too? Thanks!

@viirya
Copy link
Member Author

viirya commented Apr 11, 2016

ping @jkbradley @MLnick

:param col: the name of the numerical column
:param cols: str, list.
The name(s) of the numerical column(s). Can be a string of the name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this comment to: Can be a single column name, or a list of names for multiple columns. I think it's clear from the specified types that it's a string name or a list of string names.

(we mention in the method doc that it operates on numerical columns, we don't need to repeat that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. updated.

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55512 has finished for PR 12207 at commit 89d4d3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 11, 2016

LGTM, pending the discussion on the @see link. @jkbradley?

if isinstance(cols, tuple):
cols = list(cols)
if isinstance(cols, list):
cols = _to_list(self._sc, cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider verifying the contents of the list as done for probabilities right bellow (but just a minor point and probably not as important - just if people pass in a list of expressions rather than strings would be nice to have a useful error message).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55568 has finished for PR 12207 at commit 4309001.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 13, 2016

ping @jkbradley @mengxr

@viirya
Copy link
Member Author

viirya commented Apr 15, 2016

Let me close this due to an earlier duplicate one.

@viirya viirya closed this Apr 15, 2016
@viirya viirya deleted the multi-cols-approxquantile branch December 27, 2023 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants