Skip to content

Commit 922bc5b

Browse files
HyukjinKwoncmonkey
authored andcommitted
[SPARK-19134][EXAMPLE] Fix several sql, mllib and status api examples not working
## What changes were proposed in this pull request? **binary_classification_metrics_example.py** LibSVM datasource loads `ml.linalg.SparseVector` whereas the example requires it to be `mllib.linalg.SparseVector`. For the equivalent Scala exmaple, `BinaryClassificationMetricsExample.scala` seems fine. ``` ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py ``` ``` File ".../spark/examples/src/main/python/mllib/binary_classification_metrics_example.py", line 39, in <lambda> .rdd.map(lambda row: LabeledPoint(row[0], row[1])) File ".../spark/python/pyspark/mllib/regression.py", line 54, in __init__ self.features = _convert_to_vector(features) File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 80, in _convert_to_vector raise TypeError("Cannot convert type %s into Vector" % type(l)) TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector ``` **status_api_demo.py** (this one does not work on Python 3.4.6) It's `queue` in Python 3+. ``` PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/status_api_demo.py", line 22, in <module> import Queue ImportError: No module named 'Queue' ``` **bisecting_k_means_example.py** `BisectingKMeansModel` does not implement `save` and `load` in Python. ```bash ./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/mllib/bisecting_k_means_example.py", line 46, in <module> model.save(sc, path) AttributeError: 'BisectingKMeansModel' object has no attribute 'save' ``` **elementwise_product_example.py** It calls `collect` from the vector. ```bash ./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/mllib/elementwise_product_example.py", line 48, in <module> for each in transformedData2.collect(): File ".../spark/python/pyspark/mllib/linalg/__init__.py", line 478, in __getattr__ return getattr(self.array, item) AttributeError: 'numpy.ndarray' object has no attribute 'collect' ``` **These three tests look throwing an exception for a relative path set in `spark.sql.warehouse.dir`.** **hive.py** ``` ./bin/spark-submit examples/src/main/python/sql/hive.py ``` ``` Traceback (most recent call last): File ".../spark/examples/src/main/python/sql/hive.py", line 47, in <module> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive") File ".../spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 541, in sql File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File ".../spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse);' ``` **SparkHiveExample.scala** ``` ./bin/run-example sql.hive.SparkHiveExample ``` ``` Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498) at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668) ``` **JavaSparkHiveExample.java** ``` ./bin/run-example sql.hive.JavaSparkHiveExample ``` ``` Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: Unable to alter table. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./spark-warehouse at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:498) at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:484) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1668) ``` ## How was this patch tested? Manually via ``` ./bin/spark-submit examples/src/main/python/mllib/binary_classification_metrics_example.py ``` ``` PYSPARK_PYTHON=python3 ./bin/spark-submit examples/src/main/python/status_api_demo.py ``` ``` ./bin/spark-submit examples/src/main/python/mllib/bisecting_k_means_example.py ``` ``` ./bin/spark-submit examples/src/main/python/mllib/elementwise_product_example.py ``` ``` ./bin/spark-submit examples/src/main/python/sql/hive.py ``` ``` ./bin/run-example sql.hive.JavaSparkHiveExample ``` ``` ./bin/run-example sql.hive.SparkHiveExample ``` These were found via ```bash find ./examples/src/main/python -name "*.py" -exec spark-submit {} \; ``` Author: hyukjinkwon <[email protected]> Closes apache#16515 from HyukjinKwon/minor-example-fix.
1 parent d50abdf commit 922bc5b

File tree

7 files changed

+18
-21
lines changed

7 files changed

+18
-21
lines changed

examples/src/main/java/org/apache/spark/examples/sql/hive/JavaSparkHiveExample.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
package org.apache.spark.examples.sql.hive;
1818

1919
// $example on:spark_hive$
20+
import java.io.File;
2021
import java.io.Serializable;
2122
import java.util.ArrayList;
2223
import java.util.List;
@@ -56,7 +57,7 @@ public void setValue(String value) {
5657
public static void main(String[] args) {
5758
// $example on:spark_hive$
5859
// warehouseLocation points to the default location for managed databases and tables
59-
String warehouseLocation = "spark-warehouse";
60+
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
6061
SparkSession spark = SparkSession
6162
.builder()
6263
.appName("Java Spark Hive Example")

examples/src/main/python/mllib/binary_classification_metrics_example.py

Lines changed: 5 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -18,25 +18,20 @@
1818
Binary Classification Metrics Example.
1919
"""
2020
from __future__ import print_function
21-
from pyspark.sql import SparkSession
21+
from pyspark import SparkContext
2222
# $example on$
2323
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
2424
from pyspark.mllib.evaluation import BinaryClassificationMetrics
25-
from pyspark.mllib.regression import LabeledPoint
25+
from pyspark.mllib.util import MLUtils
2626
# $example off$
2727

2828
if __name__ == "__main__":
29-
spark = SparkSession\
30-
.builder\
31-
.appName("BinaryClassificationMetricsExample")\
32-
.getOrCreate()
29+
sc = SparkContext(appName="BinaryClassificationMetricsExample")
3330

3431
# $example on$
3532
# Several of the methods available in scala are currently missing from pyspark
3633
# Load training data in LIBSVM format
37-
data = spark\
38-
.read.format("libsvm").load("data/mllib/sample_binary_classification_data.txt")\
39-
.rdd.map(lambda row: LabeledPoint(row[0], row[1]))
34+
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")
4035

4136
# Split data into training (60%) and test (40%)
4237
training, test = data.randomSplit([0.6, 0.4], seed=11)
@@ -58,4 +53,4 @@
5853
print("Area under ROC = %s" % metrics.areaUnderROC)
5954
# $example off$
6055

61-
spark.stop()
56+
sc.stop()

examples/src/main/python/mllib/bisecting_k_means_example.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,11 +40,6 @@
4040
# Evaluate clustering
4141
cost = model.computeCost(parsedData)
4242
print("Bisecting K-means Cost = " + str(cost))
43-
44-
# Save and load model
45-
path = "target/org/apache/spark/PythonBisectingKMeansExample/BisectingKMeansModel"
46-
model.save(sc, path)
47-
sameModel = BisectingKMeansModel.load(sc, path)
4843
# $example off$
4944

5045
sc.stop()

examples/src/main/python/mllib/elementwise_product_example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@
4545
print(each)
4646

4747
print("transformedData2:")
48-
for each in transformedData2.collect():
48+
for each in transformedData2:
4949
print(each)
5050

5151
sc.stop()

examples/src/main/python/sql/hive.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
from __future__ import print_function
1919

2020
# $example on:spark_hive$
21-
from os.path import expanduser, join
21+
from os.path import expanduser, join, abspath
2222

2323
from pyspark.sql import SparkSession
2424
from pyspark.sql import Row
@@ -34,7 +34,7 @@
3434
if __name__ == "__main__":
3535
# $example on:spark_hive$
3636
# warehouse_location points to the default location for managed databases and tables
37-
warehouse_location = 'spark-warehouse'
37+
warehouse_location = abspath('spark-warehouse')
3838

3939
spark = SparkSession \
4040
.builder \

examples/src/main/python/status_api_demo.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,11 @@
1919

2020
import time
2121
import threading
22-
import Queue
22+
import sys
23+
if sys.version >= '3':
24+
import queue as Queue
25+
else:
26+
import Queue
2327

2428
from pyspark import SparkConf, SparkContext
2529

examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717
package org.apache.spark.examples.sql.hive
1818

1919
// $example on:spark_hive$
20+
import java.io.File
21+
2022
import org.apache.spark.sql.Row
2123
import org.apache.spark.sql.SparkSession
2224
// $example off:spark_hive$
@@ -38,7 +40,7 @@ object SparkHiveExample {
3840

3941
// $example on:spark_hive$
4042
// warehouseLocation points to the default location for managed databases and tables
41-
val warehouseLocation = "spark-warehouse"
43+
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
4244

4345
val spark = SparkSession
4446
.builder()

0 commit comments

Comments
 (0)