Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 14 additions & 14 deletions docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -819,16 +819,16 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet")

You can also manually specify the data source that will be used along with any extra options
that you would like to pass to the data source. Data sources are specified by their fully qualified
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use the shorted
name (`json`, `parquet`, `jdbc`). DataFrames of any type can be converted into other types
name (i.e., `org.apache.spark.sql.parquet`), but for built-in sources you can also use their short
names (`json`, `parquet`, `jdbc`). DataFrames of any type can be converted into other types
using this syntax.

<div class="codetabs">
<div data-lang="scala" markdown="1">

{% highlight scala %}
val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")
df.select("name", "age").write.format("json").save("namesAndAges.parquet")
df.select("name", "age").write.format("json").save("namesAndAges.json")
{% endhighlight %}

</div>
Expand Down Expand Up @@ -975,7 +975,7 @@ schemaPeople.write().parquet("people.parquet");
// The result of loading a parquet file is also a DataFrame.
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");

//Parquet files can also be registered as tables and then used in SQL statements.
// Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile");
DataFrame teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19");
List<String> teenagerNames = teenagers.javaRDD().map(new Function<Row, String>() {
Expand Down Expand Up @@ -1059,7 +1059,7 @@ SELECT * FROM parquetTable
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
table, data are usually stored in different directories, with partitioning column values encoded in
the path of each partition directory. The Parquet data source is now able to discover and infer
partitioning information automatically. For exmaple, we can store all our previously used
partitioning information automatically. For example, we can store all our previously used
population data into a partitioned table using the following directory structure, with two extra
columns, `gender` and `country` as partitioning columns:

Expand Down Expand Up @@ -1125,20 +1125,20 @@ source is now able to automatically detect this case and merge schemas of all th
import sqlContext.implicits._

// Create a simple DataFrame, stored into a partition directory
val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")

// Read the partitioned table
val df3 = sqlContext.read.parquet("data/test_table")
df3.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partiioning column appeared in the partition directory paths.
// with the partitioning column appeared in the partition directory paths.
// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
Expand Down Expand Up @@ -1169,7 +1169,7 @@ df3 = sqlContext.load("data/test_table", "parquet")
df3.printSchema()

# The final schema consists of all 3 columns in the Parquet files together
# with the partiioning column appeared in the partition directory paths.
# with the partitioning column appeared in the partition directory paths.
# root
# |-- single: int (nullable = true)
# |-- double: int (nullable = true)
Expand All @@ -1196,7 +1196,7 @@ df3 <- loadDF(sqlContext, "data/test_table", "parquet")
printSchema(df3)

# The final schema consists of all 3 columns in the Parquet files together
# with the partiioning column appeared in the partition directory paths.
# with the partitioning column appeared in the partition directory paths.
# root
# |-- single: int (nullable = true)
# |-- double: int (nullable = true)
Expand Down Expand Up @@ -1253,7 +1253,7 @@ Configuration of Parquet can be done using the `setConf` method on `SQLContext`
<td>false</td>
<td>
Turn on Parquet filter pushdown optimization. This feature is turned off by default because of a known
bug in Paruet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>).
bug in Parquet 1.6.0rc3 (<a href="https://issues.apache.org/jira/browse/PARQUET-136">PARQUET-136</a>).
However, if your table doesn't contain any nullable string or binary columns, it's still safe to turn
this feature on.
</td>
Expand Down Expand Up @@ -1402,7 +1402,7 @@ sqlContext <- sparkRSQL.init(sc)
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- jsonFile(sqlContex,t path)
people <- jsonFile(sqlContext, path)

# The inferred schema can be visualized using the printSchema() method.
printSchema(people)
Expand Down Expand Up @@ -1474,7 +1474,7 @@ sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
adds support for finding tables in the MetaStore and writing queries using HiveQL. In addition to
the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
the `sql` method a `HiveContext` also provides an `hql` method, which allows queries to be
expressed in HiveQL.

{% highlight java %}
Expand Down Expand Up @@ -2770,7 +2770,7 @@ from pyspark.sql.types import *
</tr>
<tr>
<td> <b>MapType</b> </td>
<td> enviroment </td>
<td> environment </td>
<td>
list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
<b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
Expand Down