Skip to content

Commit ebf9c44

Browse files
committed
2 parents 8901fd9 + b881e5e commit ebf9c44

File tree

302 files changed

+3078
-1747
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

302 files changed

+3078
-1747
lines changed
File renamed without changes.
19.3 KB
Loading
File renamed without changes.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# End-to-end machine learning workload: Census
2+
This sample code illustrates how to use Modin for ETL operations and ridge regression algorithm from the DAAL accelerated scikit-learn library to build and run an end to end machine learning workload. It demonstrates how to use software products that can be found in the [Intel AI Analytics Toolkit powered by oneAPI](https://software.intel.com/content/www/us/en/develop/tools/oneapi/ai-analytics-toolkit.html).
3+
4+
| Optimized for | Description
5+
| :--- | :---
6+
| OS | 64-bit Linux: Ubuntu 18.04 or higher
7+
| Hardware | Intel Atom® Processors; Intel® Core™ Processor Family; Intel® Xeon® Processor Family; Intel® Xeon® Scalable Performance Processor Family
8+
| Software | Python version 3.7, Modin, Ray, daal4py, Scikit-Learn, NumPy, Intel® AI Analytics Toolkit
9+
| What you will learn | How to use Modin and DAAL optimized scikit-learn (developed and owned by Intel) to build end to end ML workloads and gain performance.
10+
| Time to complete | 15-18 minutes
11+
12+
## Purpose
13+
Modin uses Ray to provide an effortless way to speed up your Pandas notebooks, scripts and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing Pandas code. Daal4py is a simplified API to Intel DAAL that allows for fast usage of the framework suited for Data Scientists and Machine Learning users. It is built to help provide an abstraction to Intel® DAAL for either direct usage or integration into one's own framework.
14+
15+
#### Model and dataset
16+
In this sample, you will use Modin to ingest and process U.S. census data from 1970 to 2010 in order to build a ridge regression based model to find the relation between education and the total income earned in the US.
17+
Data transformation stage normalizes the income to the yearly inflation, balances the data such that each year has a similar number of data points, and extracts the features from the transformed dataset. The feature vectors are fed into the ridge regression model to predict the income of each sample.
18+
19+
Dataset is from IPUMS USA, University of Minnesota , [www.ipums.org](https://ipums.org/) (Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0)
20+
21+
## Key Implementation Details
22+
This end-to-end workload sample code is implemented for CPU using the Python language. The example requires you to have Modin, Ray, daal4py, Scikit-Learn, NumPy installed inside a conda environment, similar to what is directed by the [oneAPI AI Analytics Toolkit powered by oneAPI](https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html) as well as the steps that follow in this README.
23+
24+
## License
25+
26+
This code sample is licensed under MIT license
27+
28+
## Building Modin and daal4py for CPU to build and run end-to-end workload
29+
30+
Modin and oneAPI Data Analytics Library (DAAL) is ready for use once you finish the Intel AI Analytics Toolkit installation with the Conda Package Manager.
31+
32+
You can refer to the oneAPI [main page](https://software.intel.com/en-us/oneapi), and the Toolkit [Getting Started Guide for Linux](https://software.intel.com/content/www/us/en/develop/documentation/get-started-with-ai-linux/top.html) for installation steps and scripts.
33+
34+
35+
### Activate conda environment With Root Access
36+
37+
Please follow the Getting Started Guide steps (above) to set up your oneAPI environment with the `setvars.sh` script and Intel Distribution of Modin environment installation (https://software.intel.com/content/www/us/en/develop/articles/installing-ai-kit-with-conda.html). Then navigate in Linux shell to your oneapi installation path, typically `/opt/intel/oneapi/` when installed as root or sudo, and `~/intel/oneapi/` when not installed as a super user. If you customized the installation folder, the `setvars.sh` file is in your custom folder.
38+
39+
Activate the conda environment with the following command:
40+
41+
#### Linux
42+
```
43+
source activate intel-aikit-modin
44+
```
45+
46+
### Activate conda environment Without Root Access (Optional)
47+
48+
By default, the Intel AI Analytics toolkit is installed in the `oneapi` folder, which requires root privileges to manage it. If you would like to bypass using root access to manage your conda environment, then you can clone your desired conda environment using the following command:
49+
50+
#### Linux
51+
```
52+
conda create --name intel-aikit-modin -c intel/label/oneapibeta -c intel -c conda-forge runipy intel-aikit-modin=2021.1b10
53+
```
54+
55+
Then activate your conda environment with the following command:
56+
57+
```
58+
conda activate intel-aikit-modin
59+
```
60+
61+
62+
### Install Jupyter Notebook
63+
64+
Launch Jupyter Notebook in the directory housing the code example
65+
66+
```
67+
conda install jupyter nb_conda_kernels
68+
```
69+
or
70+
```
71+
pip install jupyter
72+
```
73+
74+
### Install wget package
75+
76+
Install wget package in order to retrieve the Census dataset using HTTPS
77+
78+
```
79+
pip install wget
80+
```
81+
82+
#### View in Jupyter Notebook
83+
84+
85+
Launch Jupyter Notebook in the directory housing the code example
86+
87+
```
88+
jupyter notebook
89+
```
90+
91+
## Running the end-to-end code sample
92+
93+
### Run as Jupyter Notebook
94+
95+
Open .ipynb file and run cells in Jupyter Notebook using the "Run" button. Alternatively, the entire workbook can be run using the "Restart kernel and re-run whole notebook" button. (see image below using "census modin" sample)
96+
97+
![Click the Run Button in the Jupyter Notebook](Running_Jupyter_notebook.jpg "Run Button on Jupyter Notebook")
98+
99+
### Run as Python File
100+
101+
Open notebook in Jupyter and download as python file (see image using "census modin" sample)
102+
103+
![Download as python file in the Jupyter Notebook](Running_Jupyter_notebook_as_Python.jpg "Download as python file in the Jupyter Notebook")
104+
105+
Run the Program
106+
107+
`python census_modin.py`
108+
109+
##### Expected Printed Output:
110+
Expected Cell Output shown for census_modin.ipynb:
111+
![Output](Expected_output.jpg "Expected output for Jupyter Notebook")
19.4 KB
Loading
116 KB
Loading

0 commit comments

Comments
 (0)