|
1 | 1 | Spark SQL |
2 | 2 | ========= |
3 | 3 |
|
4 | | -This module provides support for executing relational queries expressed in either SQL or a LINQ-like Scala DSL. |
| 4 | +This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API. |
5 | 5 |
|
6 | 6 | Spark SQL is broken up into four subprojects: |
7 | 7 | - Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions. |
8 | 8 | - Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files. |
9 | 9 | - Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. |
10 | 10 | - HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server. |
11 | | - |
12 | | - |
13 | | -Other dependencies for developers |
14 | | ---------------------------------- |
15 | | -In order to create new hive test cases (i.e. a test suite based on `HiveComparisonTest`), |
16 | | -you will need to setup your development environment based on the following instructions. |
17 | | - |
18 | | -If you are working with Hive 0.12.0, you will need to set several environmental variables as follows. |
19 | | - |
20 | | -``` |
21 | | -export HIVE_HOME="<path to>/hive/build/dist" |
22 | | -export HIVE_DEV_HOME="<path to>/hive/" |
23 | | -export HADOOP_HOME="<path to>/hadoop" |
24 | | -``` |
25 | | - |
26 | | -If you are working with Hive 0.13.1, the following steps are needed: |
27 | | - |
28 | | -1. Download Hive's [0.13.1](https://archive.apache.org/dist/hive/hive-0.13.1) and set `HIVE_HOME` with `export HIVE_HOME="<path to hive>"`. Please do not set `HIVE_DEV_HOME` (See [SPARK-4119](https://issues.apache.org/jira/browse/SPARK-4119)). |
29 | | -2. Set `HADOOP_HOME` with `export HADOOP_HOME="<path to hadoop>"` |
30 | | -3. Download all Hive 0.13.1a jars (Hive jars actually used by Spark) from [here](http://mvnrepository.com/artifact/org.spark-project.hive) and replace corresponding original 0.13.1 jars in `$HIVE_HOME/lib`. |
31 | | -4. Download [Kryo 2.21 jar](http://mvnrepository.com/artifact/com.esotericsoftware.kryo/kryo/2.21) (Note: 2.22 jar does not work) and [Javolution 5.5.1 jar](http://mvnrepository.com/artifact/javolution/javolution/5.5.1) to `$HIVE_HOME/lib`. |
32 | | -5. This step is optional. But, when generating golden answer files, if a Hive query fails and you find that Hive tries to talk to HDFS or you find weird runtime NPEs, set the following in your test suite... |
33 | | - |
34 | | -``` |
35 | | -val testTempDir = Utils.createTempDir() |
36 | | -// We have to use kryo to let Hive correctly serialize some plans. |
37 | | -sql("set hive.plan.serialization.format=kryo") |
38 | | -// Explicitly set fs to local fs. |
39 | | -sql(s"set fs.default.name=file://$testTempDir/") |
40 | | -// Ask Hive to run jobs in-process as a single map and reduce task. |
41 | | -sql("set mapred.job.tracker=local") |
42 | | -``` |
43 | | - |
44 | | -Using the console |
45 | | -================= |
46 | | -An interactive scala console can be invoked by running `build/sbt hive/console`. |
47 | | -From here you can execute queries with HiveQl and manipulate DataFrame by using DSL. |
48 | | - |
49 | | -```scala |
50 | | -$ build/sbt hive/console |
51 | | - |
52 | | -[info] Starting scala interpreter... |
53 | | -import org.apache.spark.sql.catalyst.analysis._ |
54 | | -import org.apache.spark.sql.catalyst.dsl._ |
55 | | -import org.apache.spark.sql.catalyst.errors._ |
56 | | -import org.apache.spark.sql.catalyst.expressions._ |
57 | | -import org.apache.spark.sql.catalyst.plans.logical._ |
58 | | -import org.apache.spark.sql.catalyst.rules._ |
59 | | -import org.apache.spark.sql.catalyst.util._ |
60 | | -import org.apache.spark.sql.execution |
61 | | -import org.apache.spark.sql.functions._ |
62 | | -import org.apache.spark.sql.hive._ |
63 | | -import org.apache.spark.sql.hive.test.TestHive._ |
64 | | -import org.apache.spark.sql.hive.test.TestHive.implicits._ |
65 | | -import org.apache.spark.sql.types._ |
66 | | -Type in expressions to have them evaluated. |
67 | | -Type :help for more information. |
68 | | - |
69 | | -scala> val query = sql("SELECT * FROM (SELECT * FROM src) a") |
70 | | -query: org.apache.spark.sql.DataFrame = [key: int, value: string] |
71 | | -``` |
72 | | - |
73 | | -Query results are `DataFrames` and can be operated as such. |
74 | | -``` |
75 | | -scala> query.collect() |
76 | | -res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86], [311,val_311], [27,val_27]... |
77 | | -``` |
78 | | - |
79 | | -You can also build further queries on top of these `DataFrames` using the query DSL. |
80 | | -``` |
81 | | -scala> query.where(query("key") > 30).select(avg(query("key"))).collect() |
82 | | -res1: Array[org.apache.spark.sql.Row] = Array([274.79025423728814]) |
83 | | -``` |
0 commit comments