You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## What changes were proposed in this pull request?
This PR documents the scalable partition handling feature in the body of the programming guide.
Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.
## How was this patch tested?
N/A.
Author: Cheng Lian <[email protected]>
Closesapache#16424 from liancheng/scalable-partition-handling-doc.
Copy file name to clipboardExpand all lines: docs/sql-programming-guide.md
+11-4Lines changed: 11 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -515,7 +515,7 @@ new data.
515
515
### Saving to Persistent Tables
516
516
517
517
`DataFrames` can also be saved as persistent tables into Hive metastore using the `saveAsTable`
518
-
command. Notice existing Hive deployment is not necessary to use this feature. Spark will create a
518
+
command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a
519
519
default local Hive metastore (using Derby) for you. Unlike the `createOrReplaceTempView` command,
520
520
`saveAsTable` will materialize the contents of the DataFrame and create a pointer to the data in the
521
521
Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
@@ -526,11 +526,18 @@ By default `saveAsTable` will create a "managed table", meaning that the locatio
526
526
be controlled by the metastore. Managed tables will also have their data deleted automatically
527
527
when a table is dropped.
528
528
529
-
Currently, `saveAsTable` does not expose an API supporting the creation of an "External table" from a `DataFrame`,
530
-
however, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
531
-
and location of the external table as its value (String) when saving the table with `saveAsTable`. When an External table
529
+
Currently, `saveAsTable` does not expose an API supporting the creation of an "external table" from a `DataFrame`.
530
+
However, this functionality can be achieved by providing a `path` option to the `DataFrameWriter` with `path` as the key
531
+
and location of the external table as its value (a string) when saving the table with `saveAsTable`. When an External table
532
532
is dropped only its metadata is removed.
533
533
534
+
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
535
+
536
+
- Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
537
+
- Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available for tables created with the Datasource API.
538
+
539
+
Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`.
540
+
534
541
## Parquet Files
535
542
536
543
[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.
0 commit comments