Skip to content

Commit 26bc765

Browse files
hcookmarmbrus
authored andcommitted
[SQL] Minor edits to sql programming guide.
Author: Henry Cook <[email protected]> Closes apache#2316 from hcook/sql-docs and squashes the following commits: 373f94b [Henry Cook] Minor edits to sql programming guide.
1 parent 386bc24 commit 26bc765

File tree

1 file changed

+47
-45
lines changed

1 file changed

+47
-45
lines changed

docs/sql-programming-guide.md

Lines changed: 47 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ title: Spark SQL Programming Guide
1313

1414
Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using
1515
Spark. At the core of this component is a new type of RDD,
16-
[SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed
17-
[Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with
16+
[SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed of
17+
[Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects, along with
1818
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
19-
in a traditional relational database. A SchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io)
19+
in a traditional relational database. A SchemaRDD can be created from an existing RDD, a [Parquet](http://parquet.io)
2020
file, a JSON dataset, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
2121

2222
All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`.
@@ -26,21 +26,21 @@ All of the examples on this page use sample data included in the Spark distribut
2626
<div data-lang="java" markdown="1">
2727
Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
2828
Spark. At the core of this component is a new type of RDD,
29-
[JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed
30-
[Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects along with
29+
[JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed of
30+
[Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects, along with
3131
a schema that describes the data types of each column in the row. A JavaSchemaRDD is similar to a table
32-
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io)
32+
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, a [Parquet](http://parquet.io)
3333
file, a JSON dataset, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
3434
</div>
3535

3636
<div data-lang="python" markdown="1">
3737

3838
Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
3939
Spark. At the core of this component is a new type of RDD,
40-
[SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed
41-
[Row](api/python/pyspark.sql.Row-class.html) objects along with
40+
[SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed of
41+
[Row](api/python/pyspark.sql.Row-class.html) objects, along with
4242
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
43-
in a traditional relational database. A SchemaRDD can be created from an existing RDD, [Parquet](http://parquet.io)
43+
in a traditional relational database. A SchemaRDD can be created from an existing RDD, a [Parquet](http://parquet.io)
4444
file, a JSON dataset, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
4545

4646
All of the examples on this page use sample data included in the Spark distribution and can be run in the `pyspark` shell.
@@ -68,11 +68,11 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)
6868
import sqlContext.createSchemaRDD
6969
{% endhighlight %}
7070

71-
In addition to the basic SQLContext, you can also create a HiveContext, which provides a strict
72-
super set of the functionality provided by the basic SQLContext. Additional features include
71+
In addition to the basic SQLContext, you can also create a HiveContext, which provides a
72+
superset of the functionality provided by the basic SQLContext. Additional features include
7373
the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the
7474
ability to read data from Hive tables. To use a HiveContext, you do not need to have an
75-
existing hive setup, and all of the data sources available to a SQLContext are still available.
75+
existing Hive setup, and all of the data sources available to a SQLContext are still available.
7676
HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default
7777
Spark build. If these dependencies are not a problem for your application then using HiveContext
7878
is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to
@@ -95,7 +95,7 @@ In addition to the basic SQLContext, you can also create a HiveContext, which pr
9595
super set of the functionality provided by the basic SQLContext. Additional features include
9696
the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the
9797
ability to read data from Hive tables. To use a HiveContext, you do not need to have an
98-
existing hive setup, and all of the data sources available to a SQLContext are still available.
98+
existing Hive setup, and all of the data sources available to a SQLContext are still available.
9999
HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default
100100
Spark build. If these dependencies are not a problem for your application then using HiveContext
101101
is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to
@@ -118,7 +118,7 @@ In addition to the basic SQLContext, you can also create a HiveContext, which pr
118118
super set of the functionality provided by the basic SQLContext. Additional features include
119119
the ability to write queries using the more complete HiveQL parser, access to HiveUDFs, and the
120120
ability to read data from Hive tables. To use a HiveContext, you do not need to have an
121-
existing hive setup, and all of the data sources available to a SQLContext are still available.
121+
existing Hive setup, and all of the data sources available to a SQLContext are still available.
122122
HiveContext is only packaged separately to avoid including all of Hive's dependencies in the default
123123
Spark build. If these dependencies are not a problem for your application then using HiveContext
124124
is recommended for the 1.2 release of Spark. Future releases will focus on bringing SQLContext up to
@@ -146,11 +146,11 @@ describes the various methods for loading data into a SchemaRDD.
146146

147147
Spark SQL supports two different methods for converting existing RDDs into SchemaRDDs. The first
148148
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
149-
reflection based approach leads to more concise code and works well went the schema is known ahead
150-
of time, while you are writing your Spark application.
149+
reflection based approach leads to more concise code and works well when you already know the schema
150+
while writing your Spark application.
151151

152152
The second method for creating SchemaRDDs is through a programmatic interface that allows you to
153-
construct a schema and then apply it to and existing RDD. While this method is more verbose, it allows
153+
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
154154
you to construct SchemaRDDs when the columns and their types are not known until runtime.
155155

156156
### Inferring the Schema Using Reflection
@@ -266,10 +266,10 @@ List<String> teenagerNames = teenagers.map(new Function<Row, String>() {
266266

267267
<div data-lang="python" markdown="1">
268268

269-
Spark SQL can convert an RDD of Row objects to a SchemaRDD, inferring the datatypes . Rows are constructed by passing a list of
270-
key/value pairs as kwargs to the Row class. The keys of this list define the columns names of the table,
269+
Spark SQL can convert an RDD of Row objects to a SchemaRDD, inferring the datatypes. Rows are constructed by passing a list of
270+
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
271271
and the types are inferred by looking at the first row. Since we currently only look at the first
272-
row, it is important that there is no missing data in the first row of the RDD. In future version we
272+
row, it is important that there is no missing data in the first row of the RDD. In future versions we
273273
plan to more completely infer the schema by looking at more data, similar to the inference that is
274274
performed on JSON files.
275275

@@ -306,14 +306,14 @@ for teenName in teenNames.collect():
306306

307307
<div data-lang="scala" markdown="1">
308308

309-
In cases that case classes cannot be defined ahead of time (for example,
310-
the structure of records is encoded in a string or a text dataset will be parsed
309+
When case classes cannot be defined ahead of time (for example,
310+
the structure of records is encoded in a string, or a text dataset will be parsed
311311
and fields will be projected differently for different users),
312312
a `SchemaRDD` can be created programmatically with three steps.
313313

314314
1. Create an RDD of `Row`s from the original RDD;
315315
2. Create the schema represented by a `StructType` matching the structure of
316-
`Row`s in the RDD created in the step 1.
316+
`Row`s in the RDD created in Step 1.
317317
3. Apply the schema to the RDD of `Row`s via `applySchema` method provided
318318
by `SQLContext`.
319319

@@ -358,14 +358,14 @@ results.map(t => "Name: " + t(0)).collect().foreach(println)
358358

359359
<div data-lang="java" markdown="1">
360360

361-
In cases that JavaBean classes cannot be defined ahead of time (for example,
362-
the structure of records is encoded in a string or a text dataset will be parsed and
361+
When JavaBean classes cannot be defined ahead of time (for example,
362+
the structure of records is encoded in a string, or a text dataset will be parsed and
363363
fields will be projected differently for different users),
364364
a `SchemaRDD` can be created programmatically with three steps.
365365

366366
1. Create an RDD of `Row`s from the original RDD;
367367
2. Create the schema represented by a `StructType` matching the structure of
368-
`Row`s in the RDD created in the step 1.
368+
`Row`s in the RDD created in Step 1.
369369
3. Apply the schema to the RDD of `Row`s via `applySchema` method provided
370370
by `JavaSQLContext`.
371371

@@ -427,10 +427,10 @@ List<String> names = results.map(new Function<Row, String>() {
427427

428428
<div data-lang="python" markdown="1">
429429

430-
For some cases (for example, the structure of records is encoded in a string or
431-
a text dataset will be parsed and fields will be projected differently for
432-
different users), it is desired to create `SchemaRDD` with a programmatically way.
433-
It can be done with three steps.
430+
When a dictionary of kwargs cannot be defined ahead of time (for example,
431+
the structure of records is encoded in a string, or a text dataset will be parsed and
432+
fields will be projected differently for different users),
433+
a `SchemaRDD` can be created programmatically with three steps.
434434

435435
1. Create an RDD of tuples or lists from the original RDD;
436436
2. Create the schema represented by a `StructType` matching the structure of
@@ -566,7 +566,7 @@ for teenName in teenNames.collect():
566566

567567
### Configuration
568568

569-
Configuration of parquet can be done using the `setConf` method on SQLContext or by running
569+
Configuration of Parquet can be done using the `setConf` method on SQLContext or by running
570570
`SET key=value` commands using SQL.
571571

572572
<table class="table">
@@ -575,23 +575,23 @@ Configuration of parquet can be done using the `setConf` method on SQLContext or
575575
<td><code>spark.sql.parquet.binaryAsString</code></td>
576576
<td>false</td>
577577
<td>
578-
Some other parquet producing systems, in particular Impala and older versions of Spark SQL, do
579-
not differentiate between binary data and strings when writing out the parquet schema. This
578+
Some other Parquet-producing systems, in particular Impala and older versions of Spark SQL, do
579+
not differentiate between binary data and strings when writing out the Parquet schema. This
580580
flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
581581
</td>
582582
</tr>
583583
<tr>
584584
<td><code>spark.sql.parquet.cacheMetadata</code></td>
585585
<td>false</td>
586586
<td>
587-
Turns on caching of parquet schema metadata. Can speed up querying
587+
Turns on caching of Parquet schema metadata. Can speed up querying of static data.
588588
</td>
589589
</tr>
590590
<tr>
591591
<td><code>spark.sql.parquet.compression.codec</code></td>
592592
<td>snappy</td>
593593
<td>
594-
Sets the compression codec use when writing parquet files. Acceptable values include:
594+
Sets the compression codec use when writing Parquet files. Acceptable values include:
595595
uncompressed, snappy, gzip, lzo.
596596
</td>
597597
</tr>
@@ -805,9 +805,8 @@ Spark SQL can cache tables using an in-memory columnar format by calling `cacheT
805805
Then Spark SQL will scan only required columns and will automatically tune compression to minimize
806806
memory usage and GC pressure. You can call `uncacheTable("tableName")` to remove the table from memory.
807807

808-
Note that if you just call `cache` rather than `cacheTable`, tables will _not_ be cached in
809-
in-memory columnar format. So we strongly recommend using `cacheTable` whenever you want to
810-
cache tables.
808+
Note that if you call `cache` rather than `cacheTable`, tables will _not_ be cached using
809+
the in-memory columnar format, and therefore `cacheTable` is strongly recommended for this use case.
811810

812811
Configuration of in-memory caching can be done using the `setConf` method on SQLContext or by running
813812
`SET key=value` commands using SQL.
@@ -833,7 +832,7 @@ Configuration of in-memory caching can be done using the `setConf` method on SQL
833832

834833
</table>
835834

836-
## Other Configuration
835+
## Other Configuration Options
837836

838837
The following options can also be used to tune the performance of query execution. It is possible
839838
that these options will be deprecated in future release as more optimizations are performed automatically.
@@ -842,7 +841,7 @@ that these options will be deprecated in future release as more optimizations ar
842841
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
843842
<tr>
844843
<td><code>spark.sql.autoBroadcastJoinThreshold</code></td>
845-
<td>false</td>
844+
<td>10000</td>
846845
<td>
847846
Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when
848847
performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently
@@ -876,7 +875,7 @@ code.
876875
## Running the Thrift JDBC server
877876

878877
The Thrift JDBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
879-
in Hive 0.12. You can test the JDBC server with the beeline script comes with either Spark or Hive 0.12.
878+
in Hive 0.12. You can test the JDBC server with the beeline script that comes with either Spark or Hive 0.12.
880879

881880
To start the JDBC server, run the following in the Spark directory:
882881

@@ -899,12 +898,12 @@ your machine and a blank password. For secure mode, please follow the instructio
899898

900899
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
901900

902-
You may also use the beeline script comes with Hive.
901+
You may also use the beeline script that comes with Hive.
903902

904903
## Running the Spark SQL CLI
905904

906905
The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
907-
queries input from command line. Note: the Spark SQL CLI cannot talk to the Thrift JDBC server.
906+
queries input from the command line. Note that the Spark SQL CLI cannot talk to the Thrift JDBC server.
908907

909908
To start the Spark SQL CLI, run the following in the Spark directory:
910909

@@ -916,7 +915,10 @@ options.
916915

917916
# Compatibility with Other Systems
918917

919-
## Migration Guide for Shark Users
918+
## Migration Guide for Shark User
919+
920+
### Scheduling
921+
s
920922
To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a JDBC client session,
921923
users can set the `spark.sql.thriftserver.scheduler.pool` variable:
922924

@@ -925,7 +927,7 @@ users can set the `spark.sql.thriftserver.scheduler.pool` variable:
925927
### Reducer number
926928

927929
In Shark, default reducer number is 1 and is controlled by the property `mapred.reduce.tasks`. Spark
928-
SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value
930+
SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose default value
929931
is 200. Users may customize this property via `SET`:
930932

931933
SET spark.sql.shuffle.partitions=10;

0 commit comments

Comments
 (0)