Skip to content

Commit e00980f

Browse files
marmbrusahirreddy
authored andcommitted
First draft of python sql programming guide.
1 parent b0192d3 commit e00980f

File tree

1 file changed

+92
-4
lines changed

1 file changed

+92
-4
lines changed

docs/sql-programming-guide.md

Lines changed: 92 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ a schema that describes the data types of each column in the row. A SchemaRDD i
2020
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
2121
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
2222

23-
**All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell.**
23+
**All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`.**
2424

2525
</div>
2626

@@ -33,6 +33,19 @@ a schema that describes the data types of each column in the row. A JavaSchemaR
3333
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, parquet
3434
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
3535
</div>
36+
37+
<div data-lang="python" markdown="1">
38+
39+
Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
40+
Spark. At the core of this component is a new type of RDD,
41+
[SchemaRDD](). SchemaRDDs are composed
42+
[Row]() objects along with
43+
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
44+
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
45+
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
46+
47+
**All of the examples on this page use sample data included in the Spark distribution and can be run in the `pyspark` shell.**
48+
</div>
3649
</div>
3750

3851
***************************************************************************************************
@@ -44,7 +57,7 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac
4457

4558
The entry point into all relational functionality in Spark is the
4659
[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its
47-
decendents. To create a basic SQLContext, all you need is a SparkContext.
60+
descendants. To create a basic SQLContext, all you need is a SparkContext.
4861

4962
{% highlight scala %}
5063
val sc: SparkContext // An existing SparkContext.
@@ -60,7 +73,7 @@ import sqlContext._
6073

6174
The entry point into all relational functionality in Spark is the
6275
[JavaSQLContext](api/sql/core/index.html#org.apache.spark.sql.api.java.JavaSQLContext) class, or one
63-
of its decendents. To create a basic JavaSQLContext, all you need is a JavaSparkContext.
76+
of its descendants. To create a basic JavaSQLContext, all you need is a JavaSparkContext.
6477

6578
{% highlight java %}
6679
JavaSparkContext ctx = ...; // An existing JavaSparkContext.
@@ -69,6 +82,19 @@ JavaSQLContext sqlCtx = new org.apache.spark.sql.api.java.JavaSQLContext(ctx);
6982

7083
</div>
7184

85+
<div data-lang="python" markdown="1">
86+
87+
The entry point into all relational functionality in Spark is the
88+
[SQLContext]() class, or one
89+
of its decedents. To create a basic SQLContext, all you need is a SparkContext.
90+
91+
{% highlight python %}
92+
from pyspark.context import SQLContext
93+
sqlCtx = SQLContext(sc)
94+
{% endhighlight %}
95+
96+
</div>
97+
7298
</div>
7399

74100
## Running SQL on RDDs
@@ -81,7 +107,7 @@ One type of table that is supported by Spark SQL is an RDD of Scala case classes
81107
defines the schema of the table. The names of the arguments to the case class are read using
82108
reflection and become the names of the columns. Case classes can also be nested or contain complex
83109
types such as Sequences or Arrays. This RDD can be implicitly converted to a SchemaRDD and then be
84-
registered as a table. Tables can used in subsequent SQL statements.
110+
registered as a table. Tables can be used in subsequent SQL statements.
85111

86112
{% highlight scala %}
87113
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
@@ -176,6 +202,27 @@ List<String> teenagerNames = teenagers.map(new Function<Row, String>() {
176202

177203
</div>
178204

205+
<div data-lang="python" markdown="1">
206+
207+
One type of table that is supported by Spark SQL is an RDD of dictionaries. The keys of the
208+
dictionary define the columns names of the table, and the types are inferred by looking at the first
209+
row. Any RDD of dictionaries can converted to a SchemaRDD and then registered as a table. Tables
210+
can be used in subsequent SQL statements.
211+
212+
{% highlight python %}
213+
lines = sc.textFile("examples/src/main/resources/people.txt")
214+
parts = lines.map(lambda l: l.split(","))
215+
people = parts.map(lambda p: {"name": p[0], "age": int(p[1])})
216+
217+
peopleTable = sqlCtx.inferSchema(people)
218+
peopleTable.registerAsTable("people")
219+
220+
teenagers = sqlCtx.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
221+
teenNames = teenagers.map(lambda p: "Name: " + p.name)
222+
{% endhighlight %}
223+
224+
</div>
225+
179226
</div>
180227

181228
**Note that Spark SQL currently uses a very basic SQL parser.**
@@ -231,6 +278,27 @@ parquetFile.registerAsTable("parquetFile");
231278
JavaSchemaRDD teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19");
232279

233280

281+
{% endhighlight %}
282+
283+
</div>
284+
285+
<div data-lang="python" markdown="1">
286+
287+
{% highlight python %}
288+
289+
peopleTable # The SchemaRDD from the previous example.
290+
291+
# JavaSchemaRDDs can be saved as parquet files, maintaining the schema information.
292+
peopleTable.saveAsParquetFile("people.parquet")
293+
294+
// Read in the parquet file created above. Parquet files are self-describing so the schema is preserved.
295+
// The result of loading a parquet file is also a JavaSchemaRDD.
296+
parquetFile = sqlCtx.parquetFile("people.parquet")
297+
298+
//Parquet files can also be registered as tables and then used in SQL statements.
299+
parquetFile.registerAsTable("parquetFile");
300+
teenagers = sqlCtx.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
301+
234302
{% endhighlight %}
235303

236304
</div>
@@ -318,4 +386,24 @@ Row[] results = hiveCtx.hql("FROM src SELECT key, value").collect();
318386

319387
</div>
320388

389+
<div data-lang="python" markdown="1">
390+
391+
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
392+
adds support for finding tables in in the MetaStore and writing queries using HiveQL. In addition to
393+
the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
394+
expressed in HiveQL.
395+
396+
{% highlight python %}
397+
398+
from pyspark.context import HiveContext
399+
hiveCtx = HiveContext(sqlCtx)
400+
401+
hiveCtx.hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
402+
hiveCtx.hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
403+
404+
// Queries are expressed in HiveQL.
405+
results = hiveCtx.hql("FROM src SELECT key, value").collect()
406+
407+
{% endhighlight %}
408+
321409
</div>

0 commit comments

Comments
 (0)