[SPARK-1817] RDD.zip() should verify partition sizes for each partition #760

kanzhang · 2014-05-13T19:43:34Z

No description provided.

AmplabJenkins · 2014-05-13T19:47:58Z

Can one of the admins verify this patch?

This patch checks top-level closure arguments to `ClosureCleaner.clean` for `return` statements and raises an exception if it finds any. This is mainly a user-friendliness addition, since programs with return statements in closure arguments will currently fail upon RDD actions with a less-than-intuitive error message. Author: William Benton <[email protected]> Closes apache#717 from willb/spark-571 and squashes the following commits: c41eb7d [William Benton] Another test case for SPARK-571 30c42f4 [William Benton] Stylistic cleanups 559b16b [William Benton] Stylistic cleanups from review de13b79 [William Benton] Style fixes 295b6a5 [William Benton] Forbid return statements in closure arguments. b017c47 [William Benton] Added a test for SPARK-571

Summary: https://issues.apache.org/jira/browse/SPARK-1791 Simple fix, and backward compatible, since - anyone who set the threshold was getting completely wrong answers. - anyone who did not set the threshold had the default 0.0 value for the threshold anyway. Test Plan: Unit test added that is verified to fail under the old implementation, and pass under the new implementation. Reviewers: CC: Author: Andrew Tulloch <[email protected]> Closes apache#725 from ajtulloch/SPARK-1791-SVM and squashes the following commits: 770f55d [Andrew Tulloch] SPARK-1791 - SVM implementation does not use threshold parameter

The solution is to wrap a try / catch / log around the posting of each event to each listener. Author: Andrew Or <[email protected]> Closes apache#759 from andrewor14/listener-die and squashes the following commits: aee5107 [Andrew Or] Merge branch 'master' of github.com:apache/spark into listener-die 370939f [Andrew Or] Remove two layers of indirection 422d278 [Andrew Or] Explicitly throw an exception instead of 1 / 0 0df0e2a [Andrew Or] Try/catch and log exceptions when posting events

JIRA issue: [SPARK-1527](https://issues.apache.org/jira/browse/SPARK-1527) getName() only gets the last component of the file path. When deleting test-generated directories, we should pass the generated directory's absolute path to DiskBlockManager. Author: Ye Xianjin <[email protected]> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <[email protected]> Closes apache#436 from advancedxy/SPARK-1527 and squashes the following commits: 4678bab [Ye Xianjin] change rootDir*.getname to rootDir*.getAbsolutePath so the temporary directories are deleted when the test is finished.

mateiz · 2014-05-14T03:16:08Z

Hey @kanzhang, while this is a partial fix for one case, it's not going to solve SPARK-1817 in general. The right thing to do will be to warn or throw a better error when partitions are discovered to have different sizes. I'd suggest opening a separate JIRA for updating NumericRange (the change here) and keeping 1817 open.

mateiz · 2014-05-14T03:17:05Z

core/src/main/scala/org/apache/spark/rdd/ParallelCollectionRDD.scala

We should probably add a comment above this to say that the goal is to make it slice at the exact same positions as other sequences.

kanzhang · 2014-05-14T03:38:45Z

Ok, will do.

…ting Scala SQLContext. Author: Michael Armbrust <[email protected]> Closes apache#761 from marmbrus/existingContext and squashes the following commits: 4651051 [Michael Armbrust] Make it possible to create Java/Python SQLContexts from an existing Scala SQLContext.

…partition This change adds a new partitioner which allows users to specify # of keys per partition. Author: Syed Hashmi <[email protected]> Closes apache#721 from syedhashmi/master and squashes the following commits: 4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <[email protected]> Closes apache#737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.

…eys per partition" This reverts commit 92cebad.

Fixed a bug that was preventing us from ever pruning beneath Joins. ## TPC-DS Q3 ### Before: ``` Aggregate false, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand_id#64 AS brand_id#0,i_brand#65 AS brand#1,SUM(PartialSum#79) AS sum_agg#2] Exchange (HashPartitioning [d_year#12:0,i_brand#65:1,i_brand_id#64:2], 150) Aggregate true, [d_year#12,i_brand#65,i_brand_id#64], [d_year#12,i_brand#65,i_brand_id#64,SUM(CAST(ss_ext_sales_price#49, DoubleType)) AS PartialSum#79] Project [d_year#12:6,i_brand#65:59,i_brand_id#64:58,ss_ext_sales_price#49:43] HashJoin [ss_item_sk#36], [i_item_sk#57], BuildRight Exchange (HashPartitioning [ss_item_sk#36:30], 150) HashJoin [d_date_sk#6], [ss_sold_date_sk#34], BuildRight Exchange (HashPartitioning [d_date_sk#6:0], 150) Filter (d_moy#14:8 = 12) HiveTableScan [d_date_sk#6,d_date_id#7,d_date#8,d_month_seq#9,d_week_seq#10,d_quarter_seq#11,d_year#12,d_dow#13,d_moy#14,d_dom#15,d_qoy#16,d_fy_year#17,d_fy_quarter_seq#18,d_fy_week_seq#19,d_day_name#20,d_quarter_name#21,d_holiday#22,d_weekend#23,d_following_holiday#24,d_first_dom#25,d_last_dom#26,d_same_day_ly#27,d_same_day_lq#28,d_current_day#29,d_current_week#30,d_current_month#31,d_current_quarter#32,d_current_year#33], (MetastoreRelation default, date_dim, Some(dt)), None Exchange (HashPartitioning [ss_sold_date_sk#34:0], 150) HiveTableScan [ss_sold_date_sk#34,ss_sold_time_sk#35,ss_item_sk#36,ss_customer_sk#37,ss_cdemo_sk#38,ss_hdemo_sk#39,ss_addr_sk#40,ss_store_sk#41,ss_promo_sk#42,ss_ticket_number#43,ss_quantity#44,ss_wholesale_cost#45,ss_list_price#46,ss_sales_price#47,ss_ext_discount_amt#48,ss_ext_sales_price#49,ss_ext_wholesale_cost#50,ss_ext_list_price#51,ss_ext_tax#52,ss_coupon_amt#53,ss_net_paid#54,ss_net_paid_inc_tax#55,ss_net_profit#56], (MetastoreRelation default, store_sales, None), None Exchange (HashPartitioning [i_item_sk#57:0], 150) Filter (i_manufact_id#70:13 = 436) HiveTableScan [i_item_sk#57,i_item_id#58,i_rec_start_date#59,i_rec_end_date#60,i_item_desc#61,i_current_price#62,i_wholesale_cost#63,i_brand_id#64,i_brand#65,i_class_id#66,i_class#67,i_category_id#68,i_category#69,i_manufact_id#70,i_manufact#71,i_size#72,i_formulation#73,i_color#74,i_units#75,i_container#76,i_manager_id#77,i_product_name#78], (MetastoreRelation default, item, None), None ``` ### After ``` Aggregate false, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand_id#224 AS brand_id#160,i_brand#225 AS brand#161,SUM(PartialSum#239) AS sum_agg#162] Exchange (HashPartitioning [d_year#172:0,i_brand#225:1,i_brand_id#224:2], 150) Aggregate true, [d_year#172,i_brand#225,i_brand_id#224], [d_year#172,i_brand#225,i_brand_id#224,SUM(CAST(ss_ext_sales_price#209, DoubleType)) AS PartialSum#239] Project [d_year#172:1,i_brand#225:5,i_brand_id#224:3,ss_ext_sales_price#209:0] HashJoin [ss_item_sk#196], [i_item_sk#217], BuildRight Exchange (HashPartitioning [ss_item_sk#196:2], 150) Project [ss_ext_sales_price#209:2,d_year#172:1,ss_item_sk#196:3] HashJoin [d_date_sk#166], [ss_sold_date_sk#194], BuildRight Exchange (HashPartitioning [d_date_sk#166:0], 150) Project [d_date_sk#166:0,d_year#172:1] Filter (d_moy#174:2 = 12) HiveTableScan [d_date_sk#166,d_year#172,d_moy#174], (MetastoreRelation default, date_dim, Some(dt)), None Exchange (HashPartitioning [ss_sold_date_sk#194:2], 150) HiveTableScan [ss_ext_sales_price#209,ss_item_sk#196,ss_sold_date_sk#194], (MetastoreRelation default, store_sales, None), None Exchange (HashPartitioning [i_item_sk#217:1], 150) Project [i_brand_id#224:0,i_item_sk#217:1,i_brand#225:2] Filter (i_manufact_id#230:3 = 436) HiveTableScan [i_brand_id#224,i_item_sk#217,i_brand#225,i_manufact_id#230], (MetastoreRelation default, item, None), None ``` Author: Michael Armbrust <[email protected]> Closes apache#729 from marmbrus/fixPruning and squashes the following commits: 5feeff0 [Michael Armbrust] Improve column pruning.

…eve... ...loper api Author: Koert Kuipers <[email protected]> Closes apache#764 from koertkuipers/feat-rdd-developerapi and squashes the following commits: 8516dd2 [Koert Kuipers] SPARK-1801. expose InterruptibleIterator and TaskKilledException in developer api

Author: Marcelo Vanzin <[email protected]> Closes apache#763 from vanzin/netty-dep-hell and squashes the following commits: dfb6ce2 [Marcelo Vanzin] Fix dep exclusion: avro-ipc, not avro, depends on netty.

This PR replaces the Schedulable data structures in Pool.scala with thread-safe ones from java. Note that Scala's `with SynchronizedBuffer` trait is soon to be deprecated in 2.11 because it is ["inherently unreliable"](http://www.scala-lang.org/api/2.11.0/index.html#scala.collection.mutable.SynchronizedBuffer). We should slowly drift away from `SynchronizedBuffer` in other places too. Note that this PR introduces an API-breaking change; `sc.getAllPools` now returns an Array rather than an ArrayBuffer. This is because we want this method to return an immutable copy rather than one may potentially confuse the user if they try to modify the copy, which takes no effect on the original data structure. Author: Andrew Or <[email protected]> Closes apache#762 from andrewor14/pool-npe and squashes the following commits: 383e739 [Andrew Or] JavaConverters -> JavaConversions 3f32981 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe 769be19 [Andrew Or] Assorted minor changes 2189247 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe 05ad9e9 [Andrew Or] Fix test - contains is not the same as containsKey 0921ea0 [Andrew Or] var -> val 07d720c [Andrew Or] Synchronize Schedulable data structures

Pretty self-explanatory Author: Tathagata Das <[email protected]> Closes apache#722 from tdas/example-fix and squashes the following commits: 7839979 [Tathagata Das] Minor changes. 0673441 [Tathagata Das] Fixed java docs of java streaming example e687123 [Tathagata Das] Fixed scala style errors. 9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit.

…tive dependency info LICENSE and NOTICE policy is explained here: http://www.apache.org/dev/licensing-howto.html http://www.apache.org/legal/3party.html This leads to the following changes. First, this change enables two extensions to maven-shade-plugin in assembly/ that will try to include and merge all NOTICE and LICENSE files. This can't hurt. This generates a consolidated NOTICE file that I manually added to NOTICE. Next, a list of all dependencies and their licenses was generated: `mvn ... license:aggregate-add-third-party` to create: `target/generated-sources/license/THIRD-PARTY.txt` Each dependency is listed with one or more licenses. Determine the most-compatible license for each if there is more than one. For "unknown" license dependencies, I manually evaluateD their license. Many are actually Apache projects or components of projects covered already. The only non-trivial one was Colt, which has its own (compatible) license. I ignored Apache-licensed and public domain dependencies as these require no further action (beyond NOTICE above). BSD and MIT licenses (permissive Category A licenses) are evidently supposed to be mentioned in LICENSE, so I added a section without output from the THIRD-PARTY.txt file appropriately. Everything else, Category B licenses, are evidently mentioned in NOTICE (?) Same there. LICENSE contained some license statements for source code that is redistributed. I left this as I think that is the right place to put it. Author: Sean Owen <[email protected]> Closes apache#770 from srowen/SPARK-1827 and squashes the following commits: a764504 [Sean Owen] Add LICENSE and NOTICE info for all transitive dependencies as of 1.0

Place more emphasis on using precompiled binary versions of Spark and Mesos instead of encouraging the reader to compile from source. Author: Andrew Ash <[email protected]> Closes apache#756 from ash211/spark-1818 and squashes the following commits: 7ef3b33 [Andrew Ash] Brief explanation of the interactions between Spark and Mesos e7dea8e [Andrew Ash] Add troubleshooting and debugging section 956362d [Andrew Ash] Don't need to pass spark.executor.uri into the spark shell de3353b [Andrew Ash] Wrap to 100char 7ebf6ef [Andrew Ash] Polish on the section on Mesos Master URLs 3dcc2c1 [Andrew Ash] Use --tgz parameter of make-distribution 41b68ed [Andrew Ash] Period at end of sentence; formatting on :5050 8bf2c53 [Andrew Ash] Update site.MESOS_VERSIOn to match /pom.xml 74f2040 [Andrew Ash] SPARK-1818 Freshen Mesos documentation

…ther dependencies See https://issues.apache.org/jira/browse/SPARK-1828 for more information. This is being submitted to Jenkin's for testing. The dependency won't fully propagate in Maven central for a few more hours. Author: Patrick Wendell <[email protected]> Closes apache#767 from pwendell/hive-shaded and squashes the following commits: ea10ac5 [Patrick Wendell] SPARK-1828: Created forked version of hive-exec that doesn't bundle other dependencies

…uler If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that. I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions. In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620). Author: Mark Hamstra <[email protected]> Closes apache#622 from markhamstra/SPARK-1620 and squashes the following commits: 071d193 [Mark Hamstra] refactored post-SPARK-1772 1a6a35e [Mark Hamstra] another style fix d30eb94 [Mark Hamstra] scalastyle 3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit 8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler

Author: witgo <[email protected]> Closes apache#773 from witgo/sbt_javaOptions and squashes the following commits: 26c7d38 [witgo] Improve sbt configuration

As "99 ms" up to 99 ms As "0.1 s" from 0.1 s up to 0.9 s https://issues.apache.org/jira/browse/SPARK-1829 Compare the first image to the second here: http://imgur.com/RaLEsSZ,7VTlgfo#0 Author: Andrew Ash <[email protected]> Closes apache#768 from ash211/spark-1829 and squashes the following commits: 1c15b8e [Andrew Ash] SPARK-1829 Format sub-second durations more appropriately

This is nicer than relying on new SparkContext(new SparkConf()) Author: Patrick Wendell <[email protected]> Closes apache#774 from pwendell/spark-context and squashes the following commits: ef9f12f [Patrick Wendell] SPARK-1833 - Have an empty SparkContext constructor.

kanzhang · 2014-05-14T21:49:34Z

@mateiz I factored out NumericRange changes to #776.

Looked at Scaladoc just now, zip() assumes same number of elements in each partition.

The default constructor loads default properties, which can fail the test. Author: Xiangrui Meng <[email protected]> Closes apache#775 from mengxr/pyspark-conf-fix and squashes the following commits: 83ef6c4 [Xiangrui Meng] do not load defaults when testing SparkConf in pyspark

@witgo

After having been invited to make the change in apache@6bee01d#commitcomment-6284165 by @witgo. Author: Jacek Laskowski <[email protected]> Closes apache#748 from jaceklaskowski/sparkenv-string-interpolation and squashes the following commits: be6ebac [Jacek Laskowski] String interpolation + some other small changes

mateiz · 2014-05-29T00:55:45Z

Alright, you should just close this PR. I think we need to keep the JIRA issue open, but just make the fix be a better exception.

kanzhang · 2014-05-29T01:01:51Z

Ok.

Added doctest for method textFile and description for methods _initialize_context and _ensure_initialized in context.py Author: Jyotiska NK <[email protected]> Closes apache#187 from jyotiska/pyspark_context and squashes the following commits: 356f945 [Jyotiska NK] Added doctest for textFile method in context.py 5b23686 [Jyotiska NK] Updated context.py with method descriptions

Author: Yin Huai <[email protected]> Closes apache#889 from yhuai/SPARK-1935 and squashes the following commits: 7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency.

@marmbrus

JIRA issue: [SPARK-1368](https://issues.apache.org/jira/browse/SPARK-1368) This PR introduces two major updates: - Replaced FP style code with `while` loop and reusable `GenericMutableRow` object in critical path of `HiveTableScan`. - Using `ColumnProjectionUtils` to help optimizing RCFile and ORC column pruning. My quick micro benchmark suggests these two optimizations made the optimized version 2x and 2.5x faster when scanning CSV table and RCFile table respectively: ``` Original: [info] CSV: 27676 ms, RCFile: 26415 ms [info] CSV: 27703 ms, RCFile: 26029 ms [info] CSV: 27511 ms, RCFile: 25962 ms Optimized: [info] CSV: 13820 ms, RCFile: 10402 ms [info] CSV: 14158 ms, RCFile: 10691 ms [info] CSV: 13606 ms, RCFile: 10346 ms ``` The micro benchmark loads a 609MB CVS file (structurally similar to the `src` test table) into a normal Hive table with `LazySimpleSerDe` and a RCFile table, then scans these tables respectively. Preparation code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanPrepare extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ hql("drop table scan_csv") hql("drop table scan_rcfile") hql("""create table scan_csv (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' | with serdeproperties ('field.delim'=',') """.stripMargin) hql(s"""load data local inpath "${args(0)}" into table scan_csv""") hql("""create table scan_rcfile (key int, value string) | row format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' |stored as | inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' | outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' """.stripMargin) hql( """ |from scan_csv |insert overwrite table scan_rcfile |select scan_csv.key, scan_csv.value """.stripMargin) } ``` Benchmark code: ```scala package org.apache.spark.examples.sql.hive import org.apache.spark.sql.hive.LocalHiveContext import org.apache.spark.{SparkConf, SparkContext} object HiveTableScanBenchmark extends App { val sparkContext = new SparkContext( new SparkConf() .setMaster("local") .setAppName(getClass.getSimpleName.stripSuffix("$"))) val hiveContext = new LocalHiveContext(sparkContext) import hiveContext._ val scanCsv = hql("select key from scan_csv") val scanRcfile = hql("select key from scan_rcfile") val csvDuration = benchmark(scanCsv.count()) val rcfileDuration = benchmark(scanRcfile.count()) println(s"CSV: $csvDuration ms, RCFile: $rcfileDuration ms") def benchmark(f: => Unit) = { val begin = System.currentTimeMillis() f val end = System.currentTimeMillis() end - begin } } ``` @marmbrus Please help review, thanks! Author: Cheng Lian <[email protected]> Closes apache#758 from liancheng/fastHiveTableScan and squashes the following commits: 4241a19 [Cheng Lian] Distinguishes sorted and possibly not sorted operations more accurately in HiveComparisonTest cf640d8 [Cheng Lian] More HiveTableScan optimisations: bf0e7dc [Cheng Lian] Added SortedOperation pattern to match *some* definitely sorted operations and avoid some sorting cost in HiveComparisonTest. 6d1c642 [Cheng Lian] Using ColumnProjectionUtils to optimise RCFile and ORC column pruning eb62fd3 [Cheng Lian] [SPARK-1368] Optimized HiveTableScan

A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework. Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach. Author: Ankur Dave <[email protected]> Author: haroldsultan <[email protected]> Author: Harold Sultan <[email protected]> Closes apache#905 from haroldsultan/master and squashes the following commits: 327aee0 [haroldsultan] Merge pull request apache#2 from ankurdave/label-propagation 227a4d0 [Ankur Dave] Untabify 0ac574c [haroldsultan] Merge pull request apache#1 from ankurdave/label-propagation 0e24303 [Ankur Dave] Add LabelPropagationSuite 84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA 9830342 [Harold Sultan] initial version of LPA

@DeveloperAPI

We add all the classes annotated as `DeveloperApi` to `~/.mima-excludes`. Author: Prashant Sharma <[email protected]> Author: nikhil7sh <[email protected]> Closes apache#904 from ScrapCodes/SPARK-1820/ignore-Developer-Api and squashes the following commits: de944f9 [Prashant Sharma] Code review. e3c5215 [Prashant Sharma] Incorporated patrick's suggestions and fixed the scalastyle build. 9983a42 [nikhil7sh] [SPARK-1820] Make GenerateMimaIgnore @DeveloperAPI annotation aware

This is a fairly large PR to clean up and update the docs for 1.0. The major changes are: * A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs * New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark * Spark-submit guide moved to a separate page and expanded slightly * Various cleanups of the menu system, security docs, and others * Updated look of title bar to differentiate the docs from previous Spark versions You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html. Author: Matei Zaharia <[email protected]> Closes apache#896 from mateiz/1.0-docs and squashes the following commits: 03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs 0779508 [Matei Zaharia] tweak ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks 1bf4112 [Matei Zaharia] Review comments 4414f88 [Matei Zaharia] tweaks d04e979 [Matei Zaharia] Fix some old links to Java guide a34ed33 [Matei Zaharia] tweak 541bb3b [Matei Zaharia] miscellaneous changes fcefdec [Matei Zaharia] Moved submitting apps to separate doc 61d72b4 [Matei Zaharia] stuff 181f217 [Matei Zaharia] migration guide, remove old language guides e11a0da [Matei Zaharia] Add more API functions 6a030a9 [Matei Zaharia] tweaks 8db0ae3 [Matei Zaharia] Added key-value pairs section 318d2c9 [Matei Zaharia] tweaks 1c81477 [Matei Zaharia] New section on basics and function syntax e38f559 [Matei Zaharia] Actually added programming guide to Git a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout 3b6a876 [Matei Zaharia] More CSS tweaks 01ec8bf [Matei Zaharia] More CSS tweaks e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0

Author: Prashant Sharma <[email protected]> Closes apache#910 from ScrapCodes/enable-mima/spark-core and squashes the following commits: 79f3687 [Prashant Sharma] updated Mima to check against version 1.0 1e8969c [Prashant Sharma] Spark core missed out on Mima settings. So in effect we never tested spark core for mima related errors.

…ing executor's info https://issues.apache.org/jira/browse/SPARK-1901 Author: Zhen Peng <[email protected]> Closes apache#854 from zhpengg/bugfix-worker-kills-executor and squashes the following commits: 21d380b [Zhen Peng] add some error messages 506cea6 [Zhen Peng] add some docs for killProcess() a0b9860 [Zhen Peng] [SPARK-1901] worker should make sure executor has exited before updating executor's info

Author: Andrew Ash <[email protected]> Closes apache#927 from ash211/patch-5 and squashes the following commits: 79b577d [Andrew Ash] Typo: and -> an

Author: nchammas <[email protected]> Closes apache#923 from nchammas/patch-1 and squashes the following commits: 65c4d18 [nchammas] updated link to mailing list

Spark streaming requires at least two working threads, but the document gives the example like import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import org.apache.spark.streaming.api._ // Create a StreamingContext with a local master val ssc = new StreamingContext("local", "NetworkWordCount", Seconds(1)) http://spark.apache.org/docs/latest/streaming-programming-guide.html Author: CodingCat <[email protected]> Closes apache#924 from CodingCat/master and squashes the following commits: bb89f20 [CodingCat] update streaming docs

JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959) Author: Cheng Lian <[email protected]> Closes apache#909 from liancheng/spark-1959 and squashes the following commits: 306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as null value

Author: Chen Chao <[email protected]> Closes apache#928 from CrazyJvm/patch-8 and squashes the following commits: 144328b [Chen Chao] correct tiny comment error

…to prevent overflows the same as Sum. Child of `SumDistinct` or `Average` should be widened to prevent overflows the same as `Sum`. Author: Takuya UESHIN <[email protected]> Closes apache#902 from ueshin/issues/SPARK-1947 and squashes the following commits: 99c3dcb [Takuya UESHIN] Insert Cast for SumDistinct and Average.

Due to the way spark-shell launches from an assembly jar, I don't think this change will affect anyone who isn't trying to launch the shell directly from sbt. That said, it is kinda nice to be able to launch all things directly from SBT when developing. Author: Michael Armbrust <[email protected]> Closes apache#801 from marmbrus/hiveRepl and squashes the following commits: 9570571 [Michael Armbrust] Optionally include Hive as a dependency of the REPL.

Author: Michael Armbrust <[email protected]> Closes apache#913 from marmbrus/timestampMetastore and squashes the following commits: 8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.

`Properties#load()` doesn't close the InputStream, but it'd be closed after being GC'd anyway... Also changed file.getName to file, because getName only shows the filename. This will show the full (possibly relative) path, which is less confusing if it's not found. Author: Aaron Davidson <[email protected]> Closes apache#914 from aarondav/tiny and squashes the following commits: db9d072 [Aaron Davidson] Super minor: Close inputStream in SparkSubmitArguments

This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.) Author: Aaron Davidson <[email protected]> Closes apache#922 from aarondav/take and squashes the following commits: fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver

Author: witgo <[email protected]> Closes apache#786 from witgo/maven_plugin and squashes the following commits: 5de86a2 [witgo] Merge branch 'master' of https://github.com/apache/spark into maven_plugin c35ef73 [witgo] Improve maven plugin configuration

https://issues.apache.org/jira/browse/SPARK-1917 Author: Uri Laserson <[email protected]> Closes apache#866 from laserson/SPARK-1917 and squashes the following commits: d947e8c [Uri Laserson] Added test for scipy.special importing 1798bbd [Uri Laserson] SPARK-1917: fix PySpark import of scipy.special

…to ... ...a JavaSparkContext and sqlCtx will refer to a JavaSQLContext Author: Yadid Ayzenberg <[email protected]> Closes apache#932 from yadid/master and squashes the following commits: f92fb3a [Yadid Ayzenberg] updated java code blocks in spark SQL guide such that ctx will refer to a JavaSparkContext and sqlCtx will refer to a JavaSQLContext

The change set is actually pretty small -- mostly whitespace changes. Admittedly this is a scary change due to the lack of tests to cover the ec2 scripts, and also because indentation actually impacts control flow in Python ... Look at changes without whitespace diff here: https://github.com/apache/spark/pull/891/files?w=1 Author: Reynold Xin <[email protected]> Closes apache#891 from rxin/spark-ec2-pep8 and squashes the following commits: ac1bf11 [Reynold Xin] Made spark_ec2.py PEP8 compliant.

kanzhang changed the title ~~[SPARK-1817] NumericRange was partitioned differently, causing RDD.zip to fail~~ [SPARK-1817] NumericRange was partitioned differently ... May 13, 2014

kanzhang changed the title ~~[SPARK-1817] NumericRange was partitioned differently ...~~ [SPARK-1817] NumericRange was partitioned differently from ... May 13, 2014

ajtulloch and others added 3 commits May 13, 2014 17:31

mateiz reviewed May 14, 2014
View reviewed changes

marmbrus and others added 16 commits May 13, 2014 21:23

Revert "[SPARK-1784] Add a new partitioner to allow specifying # of k…

7bb9a52

…eys per partition" This reverts commit 92cebad.

Fix dep exclusion: avro-ipc, not avro, depends on netty.

54ae832

Author: Marcelo Vanzin <[email protected]> Closes apache#763 from vanzin/netty-dep-hell and squashes the following commits: dfb6ce2 [Marcelo Vanzin] Fix dep exclusion: avro-ipc, not avro, depends on netty.

Fix: sbt test throw an java.lang.OutOfMemoryError: PermGen space

fde82c1

Author: witgo <[email protected]> Closes apache#773 from witgo/sbt_javaOptions and squashes the following commits: 26c7d38 [witgo] Improve sbt configuration

kanzhang changed the title ~~[SPARK-1817] NumericRange was partitioned differently from ...~~ [SPARK-1817] RDD.zip() should verify partition sizes for each partition May 14, 2014

mengxr and others added 2 commits May 14, 2014 14:57

kanzhang closed this May 29, 2014

jyotiska and others added 23 commits May 28, 2014 23:08

SPARK-1935: Explicitly add commons-codec 1.5 as a dependency.

60b89fe

Author: Yin Huai <[email protected]> Closes apache#889 from yhuai/SPARK-1935 and squashes the following commits: 7d50ef1 [Yin Huai] Explicitly add commons-codec 1.5 as a dependency.

Typo: and -> an

9c1f204

Author: Andrew Ash <[email protected]> Closes apache#927 from ash211/patch-5 and squashes the following commits: 79b577d [Andrew Ash] Typo: and -> an

updated link to mailing list

23ae366

Author: nchammas <[email protected]> Closes apache#923 from nchammas/patch-1 and squashes the following commits: 65c4d18 [nchammas] updated link to mailing list

correct tiny comment error

9ecc40d

Author: Chen Chao <[email protected]> Closes apache#928 from CrazyJvm/patch-8 and squashes the following commits: 144328b [Chen Chao] correct tiny comment error

[SQL] SPARK-1964 Add timestamp to hive metastore type parser.

1a0da0e

Author: Michael Armbrust <[email protected]> Closes apache#913 from marmbrus/timestampMetastore and squashes the following commits: 8e0154f [Michael Armbrust] Add timestamp to hive metastore type parser.

[SPARK-1817] RDD.zip() should verify partition sizes for each partition

0cbdbc3

kanzhang reopened this Jun 2, 2014

kanzhang closed this Jun 2, 2014

kanzhang mentioned this pull request Jun 2, 2014

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #944

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #760

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #760

Uh oh!

kanzhang commented May 13, 2014

Uh oh!

AmplabJenkins commented May 13, 2014

Uh oh!

mateiz commented May 14, 2014

Uh oh!

mateiz May 14, 2014

Uh oh!

kanzhang commented May 14, 2014

Uh oh!

kanzhang commented May 14, 2014

Uh oh!

mateiz commented May 29, 2014

Uh oh!

kanzhang commented May 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

46 participants

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #760

[SPARK-1817] RDD.zip() should verify partition sizes for each partition #760

Uh oh!

Conversation

kanzhang commented May 13, 2014

Uh oh!

AmplabJenkins commented May 13, 2014

Uh oh!

mateiz commented May 14, 2014

Uh oh!

mateiz May 14, 2014

Choose a reason for hiding this comment

Uh oh!

kanzhang commented May 14, 2014

Uh oh!

kanzhang commented May 14, 2014

Uh oh!

mateiz commented May 29, 2014

Uh oh!

kanzhang commented May 29, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

46 participants