[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14174

ooq · 2016-07-13T08:04:48Z

What changes were proposed in this pull request?

This PR is the first step for the following feature:

For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a ColumnarBatch. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields).
In this JIRA, we support another implementation of fast hashmap, which is backed by a RowBasedKeyValueBatch. We then automatically pick between the two implementations based on certain knobs.

In this first-step PR, implementations for RowBasedKeyValueBatch and RowBasedHashMapGenerator are added.

How was this patch tested?

Unit tests: RowBasedKeyValueBatchSuite

SparkQA · 2016-07-13T08:10:16Z

Test build #62222 has finished for PR 14174 at commit c87f26b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public final class RowBatch extends MemoryConsumer
- class RowBasedHashMapGenerator(
- case class Buffer(dataType: DataType, name: String)
- |public class $generatedClassName extends org.apache.spark.memory.MemoryConsumer

- Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project. - Removed the inconsistency between Scala and Java Spark SQL examples - Scala and Java Spark SQL examples were updated The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review. ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png) Author: aokolnychyi <[email protected]> Closes apache#14119 from aokolnychyi/spark_16303.

ooq · 2016-07-13T08:17:17Z

cc @sameeragarwal @davies @rxin

…amTest ## What changes were proposed in this pull request? A second form of AssertQuery now actually invokes the condition; avoids a build warning too ## How was this patch tested? Jenkins; running StreamTest Author: Sean Owen <[email protected]> Closes apache#14133 from srowen/SPARK-15889.2.

…signed to numSkippedTasks ## What changes were proposed in this pull request? I fixed a misassigned var, numCompletedTasks was assigned to numSkippedTasks in the convertJobData method ## How was this patch tested? dev/run-tests Author: Alex Bozarth <[email protected]> Closes apache#14141 from ajbozarth/spark16375.

## What changes were proposed in this pull request? Fix Java style errors and remove unused imports, which are randomly found ## How was this patch tested? Tested on my local machine. Author: Xin Ren <[email protected]> Closes apache#14161 from keypointt/SPARK-16437.

## What changes were proposed in this pull request? Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator. Problem exists when server locale has no-breaking space as separator. (for example pl_PL) This patch turns off grouping and remove this separator. The problem starts with this PR. https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125 ## How was this patch tested? Manual UI tests. Screenshot attached. ![image](https://cloud.githubusercontent.com/assets/4006010/16749556/5cb5a372-47cb-11e6-9a95-67fd3f9d1c71.png) Author: Maciej Brynski <[email protected]> Closes apache#14142 from maver1ck/master.

## What changes were proposed in this pull request? Add Asynchronous Actions documentation inside action of programming guide ## How was this patch tested? check the documentation indentation and formatting with md preview. Author: sandy <[email protected]> Closes apache#14104 from phalodi/SPARK-16438.

## What changes were proposed in this pull request? Fix code error in yarn-cluster unit test. ## How was this patch tested? Use exist tests Author: sharkd <[email protected]> Closes apache#14166 from sharkdtu/master.

…g OoM for long runs ## What changes were proposed in this pull request? Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup ## How was this patch tested? Jenkins tests Author: Sean Owen <[email protected]> Closes apache#14153 from srowen/SPARK-16440.

SparkQA · 2016-07-13T11:02:13Z

Test build #62230 has finished for PR 14174 at commit 45641e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2). ## How was this patch tested? We have added a performance test and verified the reduced time. Author: oraviv <[email protected]> Closes apache#14068 from uzadude/master.

…dicates correctly in non-deterministic condition. ## What changes were proposed in this pull request? Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example: ```SELECT a FROM t WHERE rand() < 0.1 AND a = 1``` And ```SELECT a FROM t WHERE a = 1 AND rand() < 0.1``` may call rand() for different times and therefore the output rows differ. This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates. ## How was this patch tested? Expanded related testcases in FilterPushdownSuite. Author: 蒋星博 <[email protected]> Closes apache#14012 from jiangxb1987/ppd.

sameeragarwal · 2016-07-13T18:16:30Z

@ooq Thanks, can you show what the generated code looks like? Also, can we add some unit tests for these data structures?

…s than minExecutors ## What changes were proposed in this pull request? Currently if `spark.dynamicAllocation.initialExecutors` is less than `spark.dynamicAllocation.minExecutors`, Spark will automatically pick the minExecutors without any warning. While in 1.6 Spark will throw exception if configured like this. So here propose to add warning log if these parameters are configured invalidly. ## How was this patch tested? Unit test added to verify the scenario. Author: jerryshao <[email protected]> Closes apache#14149 from jerryshao/SPARK-16435.

…n audit for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Annotate Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <[email protected]> Closes apache#14147 from jkbradley/experimental-audit.

…ndowingSuite ## What changes were proposed in this pull request? It's unnecessary. `QueryTest` already sets it. Author: Burak Yavuz <[email protected]> Closes apache#14170 from brkyvz/test-tz.

## What changes were proposed in this pull request? Updated structured streaming programming guide with new windowed example. ## How was this patch tested? Docs Author: James Thomas <[email protected]> Closes apache#14183 from jjthomas/ss_docs_update.

## What changes were proposed in this pull request? Minor example updates ## How was this patch tested? manual shivaram Author: Felix Cheung <[email protected]> Closes apache#14171 from felixcheung/rexample.

…example ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <[email protected]> Closes apache#14178 from felixcheung/rcsvprogrammingguide.

…e Inferred Schema #### What changes were proposed in this pull request? If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table. ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~ For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation. #### How was this patch tested? Added test cases Author: gatorsmile <[email protected]> Closes apache#14148 from gatorsmile/describeSchema.

…s for ML, doc fixes ## What changes were proposed in this pull request? Fixing issues found during 2.0 API checks: * GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed * sqlDataTypes: name does not follow conventions. Do we need to expose it? * Evaluator: inconsistent doc between evaluate and isLargerBetter * MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little * GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name ## How was this patch tested? Existing unit tests. Docs generated locally. (MinMaxScaler is improved a tiny bit.) Author: Joseph K. Bradley <[email protected]> Closes apache#14187 from jkbradley/final-api-check-2.0.

## What changes were proposed in this pull request? This PR exposes `sql` in PySpark Shell like Scala/R Shells for consistency. **Background** * Scala ```scala scala> sql("select 1 a") res0: org.apache.spark.sql.DataFrame = [a: int] ``` * R ```r > sql("select 1") SparkDataFrame[1:int] ``` **Before** * Python ```python >>> sql("select 1 a") Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'sql' is not defined ``` **After** * Python ```python >>> sql("select 1 a") DataFrame[a: int] ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <[email protected]> Closes apache#14190 from dongjoon-hyun/SPARK-16536.

## What changes were proposed in this pull request? This patch enables SparkSession to provide spark version. ## How was this patch tested? Manual test: ``` scala> sc.version res0: String = 2.1.0-SNAPSHOT scala> spark.version res1: String = 2.1.0-SNAPSHOT ``` ``` >>> sc.version u'2.1.0-SNAPSHOT' >>> spark.version u'2.1.0-SNAPSHOT' ``` Author: Liwei Lin <[email protected]> Closes apache#14165 from lw-lin/add-version.

…adata ## What changes were proposed in this pull request? `Alias` with metadata is not a no-op and we should not strip it in `RemoveAliasOnlyProject` rule. This PR also did some improvement for this rule: 1. extend the semantic of `alias-only`. Now we allow the project list to be partially aliased. 2. add unit test for this rule. ## How was this patch tested? new `RemoveAliasOnlyProjectSuite` Author: Wenchen Fan <[email protected]> Closes apache#14106 from cloud-fan/bug.

… all used place in MLLib ## What changes were proposed in this pull request? Add warning_for the following case when LBFGS training not actually convergence: 1) LogisticRegression 2) AFTSurvivalRegression 3) LBFGS algorithm wrapper in mllib package ## How was this patch tested? N/A Author: WeichenXu <[email protected]> Closes apache#14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.

…style, minor fixes ## What changes were proposed in this pull request? Cleanup of examples, mostly from PySpark-ML to fix minor issues: unused imports, style consistency, pipeline_example is a duplicate, use future print funciton, and a spelling error. * The "Pipeline Example" is duplicated by "Simple Text Classification Pipeline" in Scala, Python, and Java. * "Estimator Transformer Param Example" is duplicated by "Simple Params Example" in Scala, Python and Java * Synced random_forest_classifier_example.py with Scala by adding IndexToString label converted * Synced train_validation_split.py (in Scala ModelSelectionViaTrainValidationExample) by adjusting data split, adding grid for intercept. * RegexTokenizer was doing nothing in tokenizer_example.py and JavaTokenizerExample.java, synced with Scala version ## How was this patch tested? local tests and run modified examples Author: Bryan Cutler <[email protected]> Closes apache#14081 from BryanCutler/examples-cleanup-SPARK-16403.

…le name ## What changes were proposed in this pull request? Due to the changes of [SPARK-14963](https://issues.apache.org/jira/browse/SPARK-14963), external shuffle recovery file name is changed mistakenly, so here change it back to the previous file name. This only affects the master branch, branch-2.0 is correct [here](https://github.com/apache/spark/blob/branch-2.0/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L195). ## How was this patch tested? N/A Author: jerryshao <[email protected]> Closes apache#14197 from jerryshao/fix-typo-file-name.

…sources based on FileFormat ## What changes were proposed in this pull request? It seems this is a regression assuming from https://issues.apache.org/jira/browse/SPARK-16698. Field name having dots throws an exception. For example the codes below: ```scala val path = "/tmp/path" val json =""" {"a.b":"data"}""" spark.sparkContext .parallelize(json :: Nil) .saveAsTextFile(path) spark.read.json(path).collect() ``` throws an exception as below: ``` Unable to resolve a.b given [a.b]; org.apache.spark.sql.AnalysisException: Unable to resolve a.b given [a.b]; at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134) at scala.Option.getOrElse(Option.scala:121) ``` This problem was introduced in apache@17eec0a#diff-27c76f96a7b2733ecfd6f46a1716e153R121 When extracting the data columns, it does not count that it can contains dots in field names. Actually, it seems the fields name are not expected as quoted when defining schema. So, It not have to consider whether this is wrapped with quotes because the actual schema (inferred or user-given schema) would not have the quotes for fields. For example, this throws an exception. (**Loading JSON from RDD is fine**) ```scala val json =""" {"a.b":"data"}""" val rdd = spark.sparkContext.parallelize(json :: Nil) spark.read.schema(StructType(Seq(StructField("`a.b`", StringType, true)))) .json(rdd).select("`a.b`").printSchema() ``` as below: ``` cannot resolve '```a.b```' given input columns: [`a.b`]; org.apache.spark.sql.AnalysisException: cannot resolve '```a.b```' given input columns: [`a.b`]; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) ``` ## How was this patch tested? Unit tests in `FileSourceStrategySuite`. Author: hyukjinkwon <[email protected]> Closes apache#14339 from HyukjinKwon/SPARK-16698-regression.

…ow functions ## What changes were proposed in this pull request? This PR fixes a minor formatting issue of `WindowSpecDefinition.sql` when no partitioning expressions are present. Before: ```sql ( ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ``` After: ```sql (ORDER BY `a` ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ``` ## How was this patch tested? New test case added in `ExpressionSQLBuilderSuite`. Author: Cheng Lian <[email protected]> Closes apache#14334 from liancheng/window-spec-sql-format.

## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung <[email protected]> Closes apache#14329 from felixcheung/rdoclinkfix.

…efault to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu <[email protected]> Closes apache#14286 from WeichenXu123/update_ann_tol.

## What changes were proposed in this pull request? This patch removes dev/audit-release. It was initially created to do basic release auditing. They have been unused by for the last one year+. ## How was this patch tested? N/A Author: Reynold Xin <[email protected]> Closes apache#14342 from rxin/SPARK-16685.

## What changes were proposed in this pull request? Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting `spark.mesos.executor.docker.forcePullImage`. Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). ## How was this patch tested? I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set. Author: Philipp Hoffmann <[email protected]> Closes apache#13051 from philipphoffmann/force-pull-image.

… doc ## What changes were proposed in this pull request? Fixed several inline formatting in ml features doc. Before: <img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png"> After: <img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png"> ## How was this patch tested? Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser. Author: Shuai Lin <[email protected]> Closes apache#14194 from lins05/fix-docs-formatting.

This reverts commit 978cd5f.

…le pages

…xception in RowBasedKeyValueBatch#spill()

…wbasedfastaggmap-pr1

ooq · 2016-07-25T20:33:31Z

Closing this now. Open another PR for cleaner commit history: #14349

SparkQA · 2016-07-25T22:06:24Z

Test build #62836 has finished for PR 14174 at commit 5bd4bca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

add RowBatch and RowBasedHashMapGenerator

c87f26b

ooq and others added 8 commits July 13, 2016 01:52

fix scale codestyle

45641e5

[MINOR][YARN] Fix code error in yarn-cluster unit test

3d6f679

## What changes were proposed in this pull request? Fix code error in yarn-cluster unit test. ## How was this patch tested? Use exist tests Author: sharkd <[email protected]> Closes apache#14166 from sharkdtu/master.

uzadude and others added 2 commits July 13, 2016 14:47

jerryshao and others added 14 commits July 13, 2016 13:24

[SPARK-16531][SQL][TEST] Remove timezone setting from DataFrameTimeWi…

0744d84

…ndowingSuite ## What changes were proposed in this pull request? It's unnecessary. `QueryTest` already sets it. Author: Burak Yavuz <[email protected]> Closes apache#14170 from brkyvz/test-tz.

[SPARKR][MINOR] R examples and test updates

b4baf08

## What changes were proposed in this pull request? Minor example updates ## How was this patch tested? manual shivaram Author: Felix Cheung <[email protected]> Closes apache#14171 from felixcheung/rexample.

HyukjinKwon and others added 25 commits July 25, 2016 22:51

Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"

fc17121

This reverts commit 978cd5f.

add RowBatch and RowBasedHashMapGenerator

9349423

fix scale codestyle

c857a10

update data structures to be consistent with what is used

bcab377

add simplerowbatch

9803fed

Add tests for SimpleRowBatch

ee0f47c

Clean logic in SimpleRowBatch that was supposedly to deal with multip…

e3260d8

…le pages

remove Rowbatch; renaming SimpleRowBatch to RowBasedKeyValueBatch

8e6816e

fix indent

64752be

Fixs all comments except re-structuring

bc69818

Refactoring RowBasedKeyValueBatch

5c2b25a

Checking in refactored code, refactoring done

9cf655a

Refactoring HashMapGenerator

1154068

minor fix

efe0114

Remove RowBasedHashMap's inheritence from MemoryConsumer. Not throw e…

8078ee6

…xception in RowBasedKeyValueBatch#spill()

Fix hash perf on strings.

babf8e4

Typo in comment and make two default values private to the base class.

d044fe7

Merge branch 'rowbasedfastaggmap-pr1' of github.com:ooq/spark into ro…

5bd4bca

…wbasedfastaggmap-pr1

ooq closed this Jul 25, 2016

ooq mentioned this pull request Jul 25, 2016

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14349

Closed

kiszk mentioned this pull request Jul 5, 2017

[SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 #18503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14174

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14174

Uh oh!

ooq commented Jul 13, 2016 •

edited

Loading

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

ooq commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

sameeragarwal commented Jul 13, 2016

Uh oh!

ooq commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

46 participants

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14174

[SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGenerator #14174

Uh oh!

Conversation

ooq commented Jul 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

ooq commented Jul 13, 2016

Uh oh!

SparkQA commented Jul 13, 2016

Uh oh!

sameeragarwal commented Jul 13, 2016

Uh oh!

ooq commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

46 participants

ooq commented Jul 13, 2016 •

edited

Loading