Skip to content

Commit f975d22

Browse files
committed
Updated docs for Hive compatibility and Shark migration guide draft
1 parent 3ad4e75 commit f975d22

File tree

1 file changed

+129
-1
lines changed

1 file changed

+129
-1
lines changed

docs/sql-programming-guide.md

Lines changed: 129 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)
136136
import sqlContext.createSchemaRDD
137137

138138
// Define the schema using a case class.
139-
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
139+
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
140140
// you can use custom classes that implement the Product interface.
141141
case class Person(name: String, age: Int)
142142

@@ -601,6 +601,134 @@ Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
601601

602602
You may also use the beeline script comes with Hive.
603603

604+
### Migration Guide for Shark Users
605+
606+
#### Reducer number
607+
608+
In Shark, default reducer number is 1, and can be tuned by property `mapred.reduce.tasks`. In Spark SQL, reducer number is default to 200, and can be customized by the `spark.sql.shuffle.partitions` property:
609+
610+
```
611+
SET spark.sql.shuffle.partitions=10;
612+
SELECT page, count(*) c FROM logs_last_month_cached
613+
GROUP BY page ORDER BY c DESC LIMIT 10;
614+
```
615+
616+
You may also put this property in `hive-site.xml` to override the default value.
617+
618+
#### Caching
619+
620+
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to let user control table caching explicitly:
621+
622+
```
623+
CACHE TABLE logs_last_month;
624+
UNCACHE TABLE logs_last_month;
625+
```
626+
627+
**NOTE** `CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary", but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be cached, you may simply count the table immediately after executing `CACHE TABLE`:
628+
629+
```
630+
CACHE TABLE logs_last_month;
631+
SELECT COUNT(1) FROM logs_last_month;
632+
```
633+
634+
Several caching related features are not supported yet:
635+
636+
* User defined partition level cache eviction policy
637+
* RDD reloading
638+
* In-memory cache write through policy
639+
640+
### Compatibility with Apache Hive
641+
642+
#### Deploying in Exising Hive Warehouses
643+
644+
Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive
645+
installations. You do not need to modify your existing Hive Metastore or change the data placement
646+
or partitioning of your tables.
647+
648+
#### Supported Hive Features
649+
650+
Spark SQL supports the vast majority of Hive features, such as:
651+
652+
* Hive query statements, including:
653+
* `SELECT`
654+
* `GROUP BY
655+
* `ORDER BY`
656+
* `CLUSTER BY`
657+
* `SORT BY`
658+
* All Hive operators, including:
659+
* Relational operators (`=`, ``, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc)
660+
* Arthimatic operators (`+`, `-`, `*`, `/`, `%`, etc)
661+
* Logical operators (`AND`, `&&`, `OR`, `||`, etc)
662+
* Complex type constructors
663+
* Mathemtatical functions (`sign`, `ln`, `cos`, etc)
664+
* String functions (`instr`, `length`, `printf`, etc)
665+
* User defined functions (UDF)
666+
* User defined aggregation functions (UDAF)
667+
* User defined serialization formats (SerDe's)
668+
* Joins
669+
* `JOIN`
670+
* `{LEFT|RIGHT|FULL} OUTER JOIN`
671+
* `LEFT SEMI JOIN`
672+
* `CROSS JOIN`
673+
* Unions
674+
* Sub queries
675+
* `SELECT col FROM ( SELECT a + b AS col from t1) t2`
676+
* Sampling
677+
* Explain
678+
* Partitioned tables
679+
* All Hive DDL Functions, including:
680+
* `CREATE TABLE`
681+
* `CREATE TABLE AS SELECT`
682+
* `ALTER TABLE`
683+
* Most Hive Data types, including:
684+
* `TINYINT`
685+
* `SMALLINT`
686+
* `INT`
687+
* `BIGINT`
688+
* `BOOLEAN`
689+
* `FLOAT`
690+
* `DOUBLE`
691+
* `STRING`
692+
* `BINARY`
693+
* `TIMESTAMP`
694+
* `ARRAY<>`
695+
* `MAP<>`
696+
* `STRUCT<>`
697+
698+
#### Unsupported Hive Functionality
699+
700+
Below is a list of Hive features that we don't support yet. Most of these features are rarely used in Hive deployments.
701+
702+
**Major Hive Features**
703+
704+
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn't support buckets yet.
705+
706+
**Esoteric Hive Features**
707+
708+
* Tables with partitions using different input formats: In Spark SQL, all table partitions need to have the same input format.
709+
* Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "`key < 10`"), Spark SQL will output wrong result for the `NULL` tuple.
710+
* `UNIONTYPE`
711+
* Unique join
712+
* Single query multi insert
713+
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment.
714+
715+
**Hive Input/Output Formats**
716+
717+
* File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
718+
* Hadoop archive
719+
720+
**Hive Optimizations**
721+
722+
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Spark SQL's in-memory computational model. Others are slotted for future releases of Spark SQL.
723+
724+
* Block level bitmap indexes and virtual columns (used to build indexes)
725+
* Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
726+
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
727+
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still launches tasks to compute the result.
728+
* Skew data flag: Spark SQL does not follow the skew data flags in Hive.
729+
* `STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
730+
* Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.
731+
604732
## Running the Spark SQL CLI
605733

606734
The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute

0 commit comments

Comments
 (0)