Skip to content

Commit 2e4075e

Browse files
committed
[SPARK-16557][SQL] Remove stale doc in sql/README.md
## What changes were proposed in this pull request? Most of the documentation in https://github.com/apache/spark/blob/master/sql/README.md is stale. It would be useful to keep the list of projects to explain what's going on, and everything else should be removed. ## How was this patch tested? N/A Author: Reynold Xin <[email protected]> Closes #14211 from rxin/SPARK-16557.
1 parent 972673a commit 2e4075e

File tree

1 file changed

+1
-74
lines changed

1 file changed

+1
-74
lines changed

sql/README.md

Lines changed: 1 addition & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,10 @@
11
Spark SQL
22
=========
33

4-
This module provides support for executing relational queries expressed in either SQL or a LINQ-like Scala DSL.
4+
This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
55

66
Spark SQL is broken up into four subprojects:
77
- Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
88
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
99
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
1010
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
11-
12-
13-
Other dependencies for developers
14-
---------------------------------
15-
In order to create new hive test cases (i.e. a test suite based on `HiveComparisonTest`),
16-
you will need to setup your development environment based on the following instructions.
17-
18-
If you are working with Hive 0.12.0, you will need to set several environmental variables as follows.
19-
20-
```
21-
export HIVE_HOME="<path to>/hive/build/dist"
22-
export HIVE_DEV_HOME="<path to>/hive/"
23-
export HADOOP_HOME="<path to>/hadoop"
24-
```
25-
26-
If you are working with Hive 0.13.1, the following steps are needed:
27-
28-
1. Download Hive's [0.13.1](https://archive.apache.org/dist/hive/hive-0.13.1) and set `HIVE_HOME` with `export HIVE_HOME="<path to hive>"`. Please do not set `HIVE_DEV_HOME` (See [SPARK-4119](https://issues.apache.org/jira/browse/SPARK-4119)).
29-
2. Set `HADOOP_HOME` with `export HADOOP_HOME="<path to hadoop>"`
30-
3. Download all Hive 0.13.1a jars (Hive jars actually used by Spark) from [here](http://mvnrepository.com/artifact/org.spark-project.hive) and replace corresponding original 0.13.1 jars in `$HIVE_HOME/lib`.
31-
4. Download [Kryo 2.21 jar](http://mvnrepository.com/artifact/com.esotericsoftware.kryo/kryo/2.21) (Note: 2.22 jar does not work) and [Javolution 5.5.1 jar](http://mvnrepository.com/artifact/javolution/javolution/5.5.1) to `$HIVE_HOME/lib`.
32-
5. This step is optional. But, when generating golden answer files, if a Hive query fails and you find that Hive tries to talk to HDFS or you find weird runtime NPEs, set the following in your test suite...
33-
34-
```
35-
val testTempDir = Utils.createTempDir()
36-
// We have to use kryo to let Hive correctly serialize some plans.
37-
sql("set hive.plan.serialization.format=kryo")
38-
// Explicitly set fs to local fs.
39-
sql(s"set fs.default.name=file://$testTempDir/")
40-
// Ask Hive to run jobs in-process as a single map and reduce task.
41-
sql("set mapred.job.tracker=local")
42-
```
43-
44-
Using the console
45-
=================
46-
An interactive scala console can be invoked by running `build/sbt hive/console`.
47-
From here you can execute queries with HiveQl and manipulate DataFrame by using DSL.
48-
49-
```scala
50-
$ build/sbt hive/console
51-
52-
[info] Starting scala interpreter...
53-
import org.apache.spark.sql.catalyst.analysis._
54-
import org.apache.spark.sql.catalyst.dsl._
55-
import org.apache.spark.sql.catalyst.errors._
56-
import org.apache.spark.sql.catalyst.expressions._
57-
import org.apache.spark.sql.catalyst.plans.logical._
58-
import org.apache.spark.sql.catalyst.rules._
59-
import org.apache.spark.sql.catalyst.util._
60-
import org.apache.spark.sql.execution
61-
import org.apache.spark.sql.functions._
62-
import org.apache.spark.sql.hive._
63-
import org.apache.spark.sql.hive.test.TestHive._
64-
import org.apache.spark.sql.hive.test.TestHive.implicits._
65-
import org.apache.spark.sql.types._
66-
Type in expressions to have them evaluated.
67-
Type :help for more information.
68-
69-
scala> val query = sql("SELECT * FROM (SELECT * FROM src) a")
70-
query: org.apache.spark.sql.DataFrame = [key: int, value: string]
71-
```
72-
73-
Query results are `DataFrames` and can be operated as such.
74-
```
75-
scala> query.collect()
76-
res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86], [311,val_311], [27,val_27]...
77-
```
78-
79-
You can also build further queries on top of these `DataFrames` using the query DSL.
80-
```
81-
scala> query.where(query("key") > 30).select(avg(query("key"))).collect()
82-
res1: Array[org.apache.spark.sql.Row] = Array([274.79025423728814])
83-
```

0 commit comments

Comments
 (0)