Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 82 additions & 0 deletions doc/run_with_hive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# How SQLFlow connects with Hive

This document is a tutorial on how SQLFlow connects Hive via [HiveServer2](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview).

## Connect Existing Hive server

To connect an existing Hive server instance, we only need to configure a `datasource` string in the format of

``` text
hive://user:password@ip:port/dbname[?auth=<auth_mechanism>&session.<cfg_key1>=<cfg_value1>...&session<cfg_keyN>=valueN]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing dot: session<cfg_keyN>=valueN]=>session.<cfg_keyN>=valueN]

```

In the above format,

- `user:password` is the username and password of hiveserver2.
- `ip:port` is the endpoint which the hiveserver2 instance listened on.
- `dbname` is the default database name.
- `auth_mechanism` is the authentication mechanism of hiveserver2, can be `NOSASL` for unsecurest transport or `PLAIN` for SASL transport.
- parameters with prefix `session.` is the session confiuration of Hive Thrift API, such as `session.mapreduce_job_queuename=mr` implies `mapreduce.job.queuename=mr`.

You can find more details at [gohive](https://sql-machine-learning.github.io/doc_index/gohive.html).

Using the `datasource` string, you can launch an all-in-one Docker container by running:

``` bash
docker run --rm -p 8888:8888 sqlflow/sqlflow bash -c \
"sqlflowserver --datasource='hive://root:root@localhost:10000/iris' &
SQLFLOW_SERVER=localhost:50051 jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''"
```

Then you can open a web browser and go to `localhost:8888`. There are many SQLFlow tutorials, e.g. `tutorial_dnn_iris.ipynb`. You can follow the tutorials and substitute the data for your own use.

## Connect standalone Hive server for testing

We also pack a standalone Hive server Docker image for testing.

### Connect Hive server with NOSASL Transport

Launch your standalone hive server Docker container by running:

``` bash
> docker run -d -p 8888:8888 --name=hive sqlflow/gohive:dev
```

This implies settings in `hive-site.xml`:

``` text
hive.server2.authentication=NOSASL
```

Test SQLFlow by running the tutorials in Jupyter Notebook:

``` bash
> docker run --rm --net=container:hive sqlflow/sqlflow \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add -p 8888:8888 so that the Jupyter server in the container can be accessed by the browser on the host?

Copy link
Collaborator Author

@Yancey0623 Yancey0623 Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sqlflow Docker container shared the network stack of hive container by --net=container:hive and the hive container exposed the port to host -p 8888:8888.

bash -c "sqlflowserver --datasource='hive://root:root@localhost:10000/' &
SQLFLOW_SERVER=localhost:50051 jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''"
```

## Connect Hive Server with PLAIN SASL Transport

This section would use the [PAM](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2#SettingUpHiveServer2-PluggableAuthenticationModules(PAM)) authentication to do the demonstration.

Launch your standalone hive server Docker container with enable the PAM authentication:

``` bash
> docker run -d -e WITH_HS2_PAM_AUTH=ON -p 8888:8888 --name=hive sqlflow/gohive:dev
```

This implies settings in `hive-site.xml`:

``` text
hive.server2.authentication=PAM
hive.server2.authentication.pam.services=login,sshd
```

Test SQLFlow by running the tutorials in Jupyter Notebook:

``` bash
> docker run --rm --net=container:hive sqlflow/sqlflow \
bash -c "sqlflowserver --datasource='hive://sqlflow:sqlflow@localhost:10000/?auth=PLAIN' &
SQLFLOW_SERVER=localhost:50051 jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root --NotebookApp.token=''"
```