You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sql-programming-guide.md
+129-1Lines changed: 129 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -136,7 +136,7 @@ val sqlContext = new org.apache.spark.sql.SQLContext(sc)
136
136
import sqlContext.createSchemaRDD
137
137
138
138
// Define the schema using a case class.
139
-
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
139
+
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
140
140
// you can use custom classes that implement the Product interface.
141
141
case class Person(name: String, age: Int)
142
142
@@ -601,6 +601,134 @@ Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
601
601
602
602
You may also use the beeline script comes with Hive.
603
603
604
+
### Migration Guide for Shark Users
605
+
606
+
#### Reducer number
607
+
608
+
In Shark, default reducer number is 1, and can be tuned by property `mapred.reduce.tasks`. In Spark SQL, reducer number is default to 200, and can be customized by the `spark.sql.shuffle.partitions` property:
609
+
610
+
```
611
+
SET spark.sql.shuffle.partitions=10;
612
+
SELECT page, count(*) c FROM logs_last_month_cached
613
+
GROUP BY page ORDER BY c DESC LIMIT 10;
614
+
```
615
+
616
+
You may also put this property in `hive-site.xml` to override the default value.
617
+
618
+
#### Caching
619
+
620
+
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to let user control table caching explicitly:
621
+
622
+
```
623
+
CACHE TABLE logs_last_month;
624
+
UNCACHE TABLE logs_last_month;
625
+
```
626
+
627
+
**NOTE**`CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary", but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be cached, you may simply count the table immediately after executing `CACHE TABLE`:
628
+
629
+
```
630
+
CACHE TABLE logs_last_month;
631
+
SELECT COUNT(1) FROM logs_last_month;
632
+
```
633
+
634
+
Several caching related features are not supported yet:
635
+
636
+
* User defined partition level cache eviction policy
637
+
* RDD reloading
638
+
* In-memory cache write through policy
639
+
640
+
### Compatibility with Apache Hive
641
+
642
+
#### Deploying in Exising Hive Warehouses
643
+
644
+
Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive
645
+
installations. You do not need to modify your existing Hive Metastore or change the data placement
646
+
or partitioning of your tables.
647
+
648
+
#### Supported Hive Features
649
+
650
+
Spark SQL supports the vast majority of Hive features, such as:
*`SELECT col FROM ( SELECT a + b AS col from t1) t2`
676
+
* Sampling
677
+
* Explain
678
+
* Partitioned tables
679
+
* All Hive DDL Functions, including:
680
+
*`CREATE TABLE`
681
+
*`CREATE TABLE AS SELECT`
682
+
*`ALTER TABLE`
683
+
* Most Hive Data types, including:
684
+
*`TINYINT`
685
+
*`SMALLINT`
686
+
*`INT`
687
+
*`BIGINT`
688
+
*`BOOLEAN`
689
+
*`FLOAT`
690
+
*`DOUBLE`
691
+
*`STRING`
692
+
*`BINARY`
693
+
*`TIMESTAMP`
694
+
*`ARRAY<>`
695
+
*`MAP<>`
696
+
*`STRUCT<>`
697
+
698
+
#### Unsupported Hive Functionality
699
+
700
+
Below is a list of Hive features that we don't support yet. Most of these features are rarely used in Hive deployments.
701
+
702
+
**Major Hive Features**
703
+
704
+
* Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn't support buckets yet.
705
+
706
+
**Esoteric Hive Features**
707
+
708
+
* Tables with partitions using different input formats: In Spark SQL, all table partitions need to have the same input format.
709
+
* Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "`key < 10`"), Spark SQL will output wrong result for the `NULL` tuple.
710
+
*`UNIONTYPE`
711
+
* Unique join
712
+
* Single query multi insert
713
+
* Column statistics collecting: Spark SQL does not piggyback scans to collect column statistics at the moment.
714
+
715
+
**Hive Input/Output Formats**
716
+
717
+
* File format for CLI: For results showing back to the CLI, Spark SQL only supports TextOutputFormat.
718
+
* Hadoop archive
719
+
720
+
**Hive Optimizations**
721
+
722
+
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Spark SQL's in-memory computational model. Others are slotted for future releases of Spark SQL.
723
+
724
+
* Block level bitmap indexes and virtual columns (used to build indexes)
725
+
* Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
726
+
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
727
+
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still launches tasks to compute the result.
728
+
* Skew data flag: Spark SQL does not follow the skew data flags in Hive.
729
+
*`STREAMTABLE` hint in join: Spark SQL does not follow the `STREAMTABLE` hint.
730
+
* Merge multiple small files for query results: if the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Spark SQL does not support that.
731
+
604
732
## Running the Spark SQL CLI
605
733
606
734
The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode and execute
0 commit comments