Skip to content

Commit be97fcc

Browse files
[New] Introduction to PySpark (#1462)
* Rebase with copy edit * Spelling Binded -> bound * Minor wording fixes * Temporarily removed shortguides due to reload issue
1 parent c4cfba0 commit be97fcc

File tree

3 files changed

+245
-1
lines changed

3 files changed

+245
-1
lines changed
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
---
2+
author:
3+
name: Jared Kobos
4+
5+
description: 'Shortguide for installing Java 8 JDK with the Oracle ppa repositories.'
6+
license: '[CC BY-ND 4.0](https://creativecommons.org/licenses/by-nd/4.0)'
7+
keywords: []
8+
modified: 2018-02-02
9+
modified_by:
10+
name: Sam Foo
11+
title: "How to Install Java 8 JDK"
12+
published: 2018-01-09
13+
shortguide: true
14+
show_on_rss_feed: false
15+
---
16+
17+
1. Install `software-properties-common` to easily add new repositories:
18+
19+
sudo apt-get install software-properties-common
20+
21+
2. Add the Java PPA in order to download from Oracle repositories:
22+
23+
sudo add-apt-repository ppa:webupd8team/java
24+
25+
3. Update the source list:
26+
27+
sudo apt-get update
28+
29+
4. Install the Java JDK 8:
30+
31+
sudo apt-get install oracle-java8-installer

docs/development/python/install_python_miniconda.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,9 @@ show_on_rss_feed: false
2222

2323
2. You will be prompted several times during the installation process. Review the terms and conditions and select "yes" for each prompt.
2424

25+
3. Restart your shell session for the changes to your PATH to take effect.
2526

26-
3. Check your Python version:
27+
28+
4. Check your Python version:
2729

2830
python --version
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
---
2+
author:
3+
name: Sam Foo
4+
5+
description: 'Learn how to install and use PySpark on your Linode for distributed computing. In this guide, we will use an example of counting words in a corpus to learn the PySpark API.'
6+
og_description: 'Learn how to install and use PySpark on your Linode for distributed computing. In this guide, we will use an example of counting words in a corpus to learn the PySpark API.'
7+
keywords: ["big data", "spark", "nltk", "mapreduce", "pyspark", "hadoop"]
8+
license: '[CC BY-ND 4.0](https://creativecommons.org/licenses/by-nd/4.0)'
9+
modified: 2018-02-05
10+
modified_by:
11+
name: Sam Foo
12+
title: "Introduction to PySpark"
13+
published: 2018-02-05
14+
external_resources:
15+
- '[AMPLab Paper on RDDs](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf)'
16+
- '[Spark Documentation](https://spark.apache.org/)'
17+
- '[PySpark Documentation](https://spark.apache.org/docs/latest/api/python/#)'
18+
---
19+
20+
## What is PySpark?
21+
22+
[Apache Spark](https://spark.apache.org/) is a big-data processing engine with several advantages over MapReduce. Spark offers greater simplicity by removing much of the boilerplate code seen in Hadoop. In addition, since Spark handles most operations in memory, it is often faster than MapReduce, where data is written to disk after each operation.
23+
24+
PySpark is a Python API for Spark. This guide shows how to install PySpark on a single Linode. PySpark's API will be introduced through an analysis of text files by counting the top five most frequent words used in every Presidential inaugural address.
25+
26+
## Install Prerequisites
27+
28+
The installation process requires the installation of Scala, which has Java JDK 8 as a dependency. Miniconda will be used to handle PySpark installation as well as downloading the data through NLTK.
29+
30+
### Miniconda
31+
32+
1. Download and install Miniconda:
33+
34+
curl -OL https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
35+
bash Miniconda3-latest-Linux-x86_64.sh
36+
37+
2. You will be prompted several times during the installation process. Review the terms and conditions and select "yes" for each prompt.
38+
39+
3. Restart your shell session for the changes to your PATH to take effect.
40+
41+
42+
4. Check your Python version:
43+
44+
python --version
45+
46+
### Java JDK 8
47+
48+
1. Install `software-properties-common` to easily add new repositories:
49+
50+
sudo apt-get install software-properties-common
51+
52+
2. Add the Java PPA in order to download from Oracle repositories:
53+
54+
sudo add-apt-repository ppa:webupd8team/java
55+
56+
3. Update the source list:
57+
58+
sudo apt-get update
59+
60+
4. Install the Java JDK 8:
61+
62+
sudo apt-get install oracle-java8-installer
63+
64+
### Scala
65+
66+
When used with Spark, Scala makes several API calls to Spark that are not supported with Python. Although Scala offers better performance than Python, Python is much easier to write and has a greater range of libraries. Depending on the use case, Scala might be preferable over PySpark.
67+
68+
1. Download the Debian package and install.
69+
70+
wget https://downloads.lightbend.com/scala/2.12.4/scala-2.12.4.deb
71+
sudo dpkg -i scala-2.12.4.deb
72+
73+
## Install PySpark
74+
75+
1. Using Miniconda, create a new virtual environment:
76+
77+
conda create -n linode_pyspark python=3
78+
source activate linode_pyspark
79+
80+
2. Install PySpark and the [Natural Language Toolkit (NLTK)](http://www.nltk.org/):
81+
82+
conda install -c conda-forge pyspark nltk
83+
84+
3. Start PySpark. There will be a few warnings because the configuration is not set up for a cluster.
85+
86+
pyspark
87+
88+
{{< output >}}
89+
Python 3.6.3 |Anaconda, Inc.| (default, Nov 20 2017, 20:41:42)
90+
[GCC 7.2.0] on linux
91+
Type "help", "copyright", "credits" or "license" for more information.
92+
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
93+
Setting default log level to "WARN".
94+
...
95+
Welcome to
96+
____ __
97+
/ __/__ ___ _____/ /__
98+
_\ \/ _ \/ _ `/ __/ '_/
99+
/__ / .__/\_,_/_/ /_/\_\ version 2.2.1
100+
/_/
101+
102+
Using Python version 3.6.3 (default, Nov 20 2017 20:41:42)
103+
SparkSession available as 'spark'.
104+
>>>
105+
{{< /output >}}
106+
107+
## Download Sample Data
108+
109+
The data used in this guide is a compilation of text files of every Presidential inaugural address from 1789 to 2009. This dataset is available from NLTK. Miniconda and the NLTK package have built-in functionality to simplify downloading from the command line.
110+
111+
1. Import NLTK and download the text files. In addition to the corpus, download a list of stop words.
112+
113+
import nltk
114+
nltk.download('inaugural')
115+
nltk.download('stopwords')
116+
117+
2. Import the file objects and show a list of available text files downloaded from the NLTK package.
118+
119+
from nltk.corpus import inaugural, stopwords
120+
inaugural.fileids()
121+
122+
This should return a list of text files of the Inaugural Address from George Washington to Barack Obama.
123+
124+
{{< note >}}
125+
The files are located in `/home/linode/nltk_data/corpora/inaugural/` where `linode` is the username.
126+
{{< /note >}}
127+
128+
Although it is possible to accomplish most objectives of this guide purely with Python, the aim is to demonstrate the PySpark API, which will also work with data distributed across a cluster.
129+
130+
## PySpark API
131+
132+
Spark utilizes the concept of a Resilient Distributed Dataset (RDD). The RDD is characterized by:
133+
134+
- Immutability - Changes to the data returns a new RDD rather than modifying an existing one
135+
- Distributed - Data can exist on a cluster and be operated on in parallel
136+
- Partitioned - More partitions allow work to be distributed among the cluster but too many partitions create unnecessary overhead in scheduling
137+
138+
This portion of the guide will focus on how to load data into PySpark as an RDD. Then, some of the PySpark API is demonstrated through simple operations like counting. Finally, more complex methods like functions like filtering and aggregation will be used to count the most frequent words in inaugural addresses.
139+
140+
### Read Data into PySpark
141+
142+
Since PySpark is run from the shell, SparkContext is already bound to the variable `sc`. For standalone programs running outside of the shell, SparkContext needs to be imported. The SparkContext object represents the entry point for Spark's functionality.
143+
144+
1. Read from the collection of text files from NLTK, taking care to specify the absolute path of the text files. Assuming the corpus was downloaded though the method described above, replace `linode` with your Unix username:
145+
146+
text_files = sc.textFile("file:///home/linode/nltk_data/corpora/inaugural/*.txt")
147+
148+
2. There are two types of operations in Spark: __transformations__ and __actions__. Transformations are lazy loaded operations that return an RDD. However, this means Spark does not actually compute the transformations until an __action__ requires returning a result. An example of an action is the `count()` method, which counts the total number of lines in all the files:
149+
150+
>>> text_files.count()
151+
2873
152+
153+
154+
### Clean and Tokenize Data
155+
156+
1. To count words, the sentences must be tokenized. Before this can be done, remove all punctuation and convert all of the words to lowercase to simplify counting:
157+
158+
import string
159+
removed_punct = text_files.map(lambda sent: sent.translate({ord(c): None for c in string.punctuation}).lower())
160+
161+
Since `map` is a transformation, the function is not applied until an action takes place.
162+
163+
{{< note >}}
164+
If a step is unclear, try `.collect()` to see the intermediary outputs.
165+
{{< /note >}}
166+
167+
2. Tokenize the sentences:
168+
169+
tokenize = removed_punct.flatMap(lambda sent: sent.split(" "))
170+
171+
{{< note >}}
172+
Similar to Python's `map` function, PySpark's `map` returns an RDD with an equal number of elements (2873, in this example). `flatMap` allows transformation of an RDD to another size which is needed when tokenizing words.
173+
{{< /note >}}
174+
175+
176+
### Filter and Aggregate Data
177+
178+
1. Through method chaining, multiple transformations can be used instead of creating a new reference to an RDD each step. `reduceByKey` is the transformation that counts each word by aggregating each word value pair.
179+
180+
result = tokenize.map(lambda word: (word, 1))\
181+
.reduceByKey(lambda a, b: a + b)
182+
183+
2. Stopwords (such as "a", "an", "the", etc) should be removed because those words are used frequently in the English language but provide no value in this context. While filtering, clean the data by removing empty strings. Results are then sorted via `takeOrdered` with the top five most frequent words returned.
184+
185+
words = stopwords.words('english')
186+
187+
result.filter(lambda word: word[0] not in words and word[0] != '')\
188+
.takeOrdered(5, key = lambda x: -x[1])
189+
190+
{{< output >}}
191+
[('government', 557), ('people', 553), ('us', 455), ('upon', 369), ('must', 346)]
192+
{{< /output >}}
193+
194+
Among the top five words, "government" is the most frequent word with a count of 557 with "people" at a close 553. The transformations and action can be summarized concisely. Remember to replace `linode` with your Unix username.
195+
196+
3. The operations can be summarized as:
197+
198+
import string
199+
from nltk.corpus import stopwords
200+
201+
words = stopwords.words('english')
202+
203+
sc.textFile("file:///home/linode/nltk_data/corpora/inaugural/*.txt")\
204+
.map(lambda sent: sent.translate({ord(c): None for c in string.punctuation}).lower())\
205+
.flatMap(lambda sent: sent.split(" "))\
206+
.map(lambda word: (word, 1))\
207+
.reduceByKey(lambda a, b: a + b)\
208+
.filter(lambda word: word[0] not in words and word[0] != '')\
209+
.takeOrdered(5, key = lambda x: -x[1])
210+
211+
PySpark has many additional capabilities, including DataFrames, SQL, streaming, and even a machine learning module. Refer to the [PySpark documentation](https://spark.apache.org/docs/latest/api/python/) for a comprehensive list.

0 commit comments

Comments
 (0)