Skip to content

Commit d2103c6

Browse files
authored
fix: jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data dev install (#257)
* fix: jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data dev install * docs: remove dead link * update notebook
1 parent e706b7b commit d2103c6

File tree

3 files changed

+5
-5
lines changed

3 files changed

+5
-5
lines changed

docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detection-taxi-data.adoc

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
:scikit-lib: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
44
:k8s-cpu: https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu
55
:spark-pkg: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html
6-
:forest-article: https://towardsdatascience.com/isolation-forest-and-spark-b88ade6c63ff
76
:pyspark: https://spark.apache.org/docs/latest/api/python/getting_started/index.html
87
:forest-algo: https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
98
:nyc-taxi: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
@@ -133,7 +132,7 @@ In practice, clients of Spark Connect do not need a full-blown Spark installatio
133132
== Model details
134133

135134
The job uses an implementation of the Isolation Forest {forest-algo}[algorithm] provided by the scikit-learn {scikit-lib}[library]:
136-
the model is trained and then invoked by a user-defined function (see {forest-article}[this article] for how to call the sklearn library with a pyspark UDF), all of which is run using the Spark Connect executors.
135+
the model is trained and then invoked by a user-defined function running on the Spark Connect executors.
137136
This type of model attempts to isolate each data point by continually partitioning the data.
138137
Data closely packed together will require more partitions to separate data points.
139138
In contrast, any outliers will require less: the number of partitions needed for a particular data point is thus inversely proportional to the anomaly "score".

stacks/jupyterhub-pyspark-hdfs/notebook.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,14 @@
2727
},
2828
{
2929
"cell_type": "code",
30-
"execution_count": 2,
30+
"execution_count": null,
3131
"metadata": {},
3232
"outputs": [],
3333
"source": [
3434
"spark = (\n",
3535
" SparkSession\n",
3636
" .builder\n",
37-
" .remote(\"sc://spark-connect-server-default:15002\")\n",
37+
" .remote(\"sc://spark-connect-server:15002\")\n",
3838
" .appName(\"taxi-data-anomaly-detection\")\n",
3939
" .getOrCreate()\n",
4040
")"

stacks/jupyterhub-pyspark-hdfs/spark_connect.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,8 +53,9 @@ spec:
5353
- name: hdfs-discovery-configmap
5454
configMap:
5555
name: hdfs
56-
config:
56+
roleConfig:
5757
listenerClass: external-unstable
58+
config:
5859
resources:
5960
memory:
6061
limit: "2Gi"

0 commit comments

Comments
 (0)