Skip to content

Commit 58f6e27

Browse files
felixcheungliancheng
authored andcommitted
[SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include sparkSession in R
## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung <[email protected]> Closes #13799 from felixcheung/rsqlprogrammingguide.
1 parent 0736753 commit 58f6e27

File tree

2 files changed

+17
-19
lines changed

2 files changed

+17
-19
lines changed

docs/sparkr.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ write.df(people, path="people.parquet", source="parquet", mode="overwrite")
152152

153153
### From Hive tables
154154

155-
You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sqlcontext). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`).
155+
You can also create SparkDataFrames from Hive tables. To do this we will need to create a SparkSession with Hive support which can access tables in the Hive MetaStore. Note that Spark should have been built with [Hive support](building-spark.html#building-with-hive-and-jdbc-support) and more details can be found in the [SQL programming guide](sql-programming-guide.html#starting-point-sparksession). In SparkR, by default it will attempt to create a SparkSession with Hive support enabled (`enableHiveSupport = TRUE`).
156156

157157
<div data-lang="r" markdown="1">
158158
{% highlight r %}

docs/sql-programming-guide.md

Lines changed: 16 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -107,19 +107,17 @@ spark = SparkSession.build \
107107

108108
<div data-lang="r" markdown="1">
109109

110-
Unlike Scala, Java, and Python API, we haven't finished migrating `SQLContext` to `SparkSession` for SparkR yet, so
111-
the entry point into all relational functionality in SparkR is still the
112-
`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you need is a `SparkContext`.
110+
The entry point into all functionality in Spark is the [`SparkSession`](api/R/sparkR.session.html) class. To initialize a basic `SparkSession`, just call `sparkR.session()`:
113111

114112
{% highlight r %}
115-
spark <- sparkRSQL.init(sc)
113+
sparkR.session()
116114
{% endhighlight %}
117115

118-
Note that when invoked for the first time, `sparkRSQL.init()` initializes a global `SQLContext` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SQLContext` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SQLContext` instance around.
116+
Note that when invoked for the first time, `sparkR.session()` initializes a global `SparkSession` singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the `SparkSession` once, then SparkR functions like `read.df` will be able to access this global instance implicitly, and users don't need to pass the `SparkSession` instance around.
119117
</div>
120118
</div>
121119

122-
`SparkSession` (or `SQLContext` for SparkR) in Spark 2.0 provides builtin support for Hive features including the ability to
120+
`SparkSession` in Spark 2.0 provides builtin support for Hive features including the ability to
123121
write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
124122
To use these features, you do not need to have an existing Hive setup.
125123

@@ -175,15 +173,15 @@ df.show()
175173
</div>
176174

177175
<div data-lang="r" markdown="1">
178-
With a `SQLContext`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
176+
With a `SparkSession`, applications can create DataFrames from a local R data.frame,
179177
from a Hive table, or from [Spark data sources](#data-sources).
180178

181179
As an example, the following creates a DataFrame based on the content of a JSON file:
182180

183181
{% highlight r %}
184182
df <- read.json("examples/src/main/resources/people.json")
185183

186-
# Displays the content of the DataFrame to stdout
184+
# Displays the content of the DataFrame
187185
showDF(df)
188186
{% endhighlight %}
189187

@@ -415,7 +413,7 @@ showDF(count(groupBy(df, "age")))
415413

416414
For a complete list of the types of operations that can be performed on a DataFrame refer to the [API Documentation](api/R/index.html).
417415

418-
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/index.html).
416+
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/R/SparkDataFrame.html).
419417

420418
</div>
421419

@@ -452,7 +450,7 @@ df = spark.sql("SELECT * FROM table")
452450
</div>
453451

454452
<div data-lang="r" markdown="1">
455-
The `sql` function enables applications to run SQL queries programmatically and returns the result as a `DataFrame`.
453+
The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`.
456454

457455
{% highlight r %}
458456
df <- sql("SELECT * FROM table")
@@ -1159,11 +1157,10 @@ for teenName in teenNames.collect():
11591157
<div data-lang="r" markdown="1">
11601158

11611159
{% highlight r %}
1162-
# spark from the previous example is used in this example.
11631160

1164-
schemaPeople # The DataFrame from the previous example.
1161+
schemaPeople # The SparkDataFrame from the previous example.
11651162

1166-
# DataFrames can be saved as Parquet files, maintaining the schema information.
1163+
# SparkDataFrame can be saved as Parquet files, maintaining the schema information.
11671164
write.parquet(schemaPeople, "people.parquet")
11681165

11691166
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
@@ -1342,7 +1339,6 @@ df3.printSchema()
13421339
<div data-lang="r" markdown="1">
13431340

13441341
{% highlight r %}
1345-
# spark from the previous example is used in this example.
13461342

13471343
# Create a simple DataFrame, stored into a partition directory
13481344
write.df(df1, "data/test_table/key=1", "parquet", "overwrite")
@@ -1621,7 +1617,7 @@ anotherPeople = spark.jsonRDD(anotherPeopleRDD)
16211617

16221618
<div data-lang="r" markdown="1">
16231619
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using
1624-
the `jsonFile` function, which loads data from a directory of JSON files where each line of the
1620+
the `read.json()` function, which loads data from a directory of JSON files where each line of the
16251621
files is a JSON object.
16261622

16271623
Note that the file that is offered as _a json file_ is not a typical JSON file. Each
@@ -1644,7 +1640,7 @@ printSchema(people)
16441640
# Register this DataFrame as a table.
16451641
createOrReplaceTempView(people, "people")
16461642

1647-
# SQL statements can be run by using the sql methods provided by `spark`.
1643+
# SQL statements can be run by using the sql methods.
16481644
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
16491645
{% endhighlight %}
16501646
</div>
@@ -1759,9 +1755,11 @@ results = spark.sql("FROM src SELECT key, value").collect()
17591755

17601756
<div data-lang="r" markdown="1">
17611757

1762-
When working with Hive one must construct a `HiveContext`, which inherits from `SparkSession`, and
1758+
When working with Hive one must instantiate `SparkSession` with Hive support. This
17631759
adds support for finding tables in the MetaStore and writing queries using HiveQL.
17641760
{% highlight r %}
1761+
# enableHiveSupport defaults to TRUE
1762+
sparkR.session(enableHiveSupport = TRUE)
17651763
sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
17661764
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
17671765

@@ -1947,7 +1945,7 @@ df = spark.read.format('jdbc').options(url='jdbc:postgresql:dbserver', dbtable='
19471945

19481946
{% highlight r %}
19491947

1950-
df <- loadDF(spark, source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
1948+
df <- read.jdbc("jdbc:postgresql:dbserver", "schema.tablename", user = "username", password = "password")
19511949

19521950
{% endhighlight %}
19531951

0 commit comments

Comments
 (0)