Skip to content

Commit fa602b8

Browse files
authored
Merge pull request #155 from JohT/feature/migrate-from-sklearn-to-open-tsne
Migrate from sklearn.manifold TSNE to openTSNE for visualizing node embeddings
2 parents e3253c2 + 74364b5 commit fa602b8

File tree

5 files changed

+27
-34
lines changed

5 files changed

+27
-34
lines changed

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,4 +91,7 @@ coverage/
9191

9292
# Jupyter Notebook
9393
.ipynb_checkpoints
94-
*.nbconvert*
94+
*.nbconvert*
95+
96+
# Python environments
97+
.conda

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
120120
- [pip](https://pip.pypa.io/en/stable)
121121
- [monotonic](https://github.com/atdt/monotonic)
122122
- [Neo4j Python Driver](https://neo4j.com/docs/api/python-driver)
123-
- [sklearn](https://scikit-learn.org)
123+
- [openTSNE](https://github.com/pavlin-policar/openTSNE)
124124
- [wordcloud](https://github.com/amueller/word_cloud)
125125
- [Graph Visualization](./graph-visualization/README.md) uses [node.js](https://nodejs.org/de) and the dependencies listed in [package.json](./graph-visualization/package.json).
126126

jupyter/NodeEmbeddingsJava.ipynb

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
"import matplotlib.pyplot as plot\n",
5959
"import typing as typ\n",
6060
"import numpy as np\n",
61-
"from sklearn.manifold import TSNE\n",
61+
"from openTSNE.sklearn import TSNE\n",
6262
"from neo4j import GraphDatabase"
6363
]
6464
},
@@ -69,9 +69,9 @@
6969
"metadata": {},
7070
"outputs": [],
7171
"source": [
72-
"import sklearn\n",
73-
"print('The scikit-learn version is {}.'.format(sklearn.__version__))\n",
74-
"print('The pandas version is {}.'.format(pd.__version__))\n"
72+
"from openTSNE import __version__ as openTSNE_version\n",
73+
"print('The openTSNE version is: {}'.format(openTSNE_version))\n",
74+
"print('The pandas version is: {}'.format(pd.__version__))\n"
7575
]
7676
},
7777
{
@@ -231,7 +231,7 @@
231231
"\n",
232232
"> It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.\n",
233233
"\n",
234-
"(see https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE)"
234+
"(see https://opentsne.readthedocs.io)"
235235
]
236236
},
237237
{
@@ -245,7 +245,7 @@
245245
" \"\"\"\n",
246246
" Reduces the dimensionality of the node embeddings (e.g. 64 floating point numbers in an array)\n",
247247
" to two dimensions for 2D visualization.\n",
248-
" see https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE\n",
248+
" see https://opentsne.readthedocs.io\n",
249249
" \"\"\"\n",
250250
"\n",
251251
" if embeddings.empty: \n",
@@ -258,16 +258,9 @@
258258
" # See https://bobbyhadz.com/blog/python-attributeerror-list-object-has-no-attribute-shape\n",
259259
" embeddings_as_numpy_array = np.array(embeddings.embedding.to_list())\n",
260260
"\n",
261-
" # The parameter \"perplexity\" needs to be smaller than the sample size\n",
262-
" # See https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html\n",
263-
" number_of_nodes=embeddings.shape[0]\n",
264-
" perplexity = min(number_of_nodes - 1.0, 30.0)\n",
265-
" print(\"t-SNE: Sample size (Number of nodes)={size}\".format(size = number_of_nodes))\n",
266-
" print(\"t-SNE: perplexity={perplexity}\".format(perplexity=perplexity))\n",
267-
"\n",
268261
" # Use t-distributed stochastic neighbor embedding (t-SNE) to reduce the dimensionality \n",
269262
" # of the previously calculated node embeddings to 2 dimensions for visualization\n",
270-
" t_distributed_stochastic_neighbor_embedding = TSNE(n_components=2, perplexity=perplexity, verbose=1, random_state=50)\n",
263+
" t_distributed_stochastic_neighbor_embedding = TSNE(n_components=2, verbose=1, random_state=47)\n",
271264
" two_dimension_node_embeddings = t_distributed_stochastic_neighbor_embedding.fit_transform(embeddings_as_numpy_array)\n",
272265
" display(two_dimension_node_embeddings.shape) # Display the shape of the t-SNE result\n",
273266
"\n",
@@ -365,7 +358,9 @@
365358
"source": [
366359
"### 1.1 Generate Node Embeddings using Fast Random Projection (Fast RP) for Java Packages\n",
367360
"\n",
368-
"[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors."
361+
"[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.\n",
362+
"\n",
363+
"**👉Hint:** To skip existing node embeddings and always calculate them based on the parameters below edit `Node_Embeddings_0a_Query_Calculated` so that it won't return any results."
369364
]
370365
},
371366
{
@@ -511,7 +506,7 @@
511506
"name": "python",
512507
"nbconvert_exporter": "python",
513508
"pygments_lexer": "ipython3",
514-
"version": "3.11.4"
509+
"version": "3.11.9"
515510
},
516511
"title": "Object Oriented Design Quality Metrics for Java with Neo4j"
517512
},

jupyter/NodeEmbeddingsTypescript.ipynb

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
"import matplotlib.pyplot as plot\n",
5959
"import typing as typ\n",
6060
"import numpy as np\n",
61-
"from sklearn.manifold import TSNE\n",
61+
"from openTSNE.sklearn import TSNE\n",
6262
"from neo4j import GraphDatabase"
6363
]
6464
},
@@ -69,8 +69,8 @@
6969
"metadata": {},
7070
"outputs": [],
7171
"source": [
72-
"import sklearn\n",
73-
"print('The scikit-learn version is {}.'.format(sklearn.__version__))\n",
72+
"from openTSNE import __version__ as openTSNE_version\n",
73+
"print('The openTSNE version is: {}'.format(openTSNE_version))\n",
7474
"print('The pandas version is {}.'.format(pd.__version__))\n"
7575
]
7676
},
@@ -231,7 +231,7 @@
231231
"\n",
232232
"> It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.\n",
233233
"\n",
234-
"(see https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE)"
234+
"(see https://opentsne.readthedocs.io)"
235235
]
236236
},
237237
{
@@ -245,7 +245,7 @@
245245
" \"\"\"\n",
246246
" Reduces the dimensionality of the node embeddings (e.g. 32 floating point numbers in an array)\n",
247247
" to two dimensions for 2D visualization.\n",
248-
" see https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE\n",
248+
" see https://opentsne.readthedocs.io\n",
249249
" \"\"\"\n",
250250
"\n",
251251
" if embeddings.empty: \n",
@@ -258,16 +258,9 @@
258258
" # See https://bobbyhadz.com/blog/python-attributeerror-list-object-has-no-attribute-shape\n",
259259
" embeddings_as_numpy_array = np.array(embeddings.embedding.to_list())\n",
260260
"\n",
261-
" # The parameter \"perplexity\" needs to be smaller than the sample size\n",
262-
" # See https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html\n",
263-
" number_of_nodes=embeddings.shape[0]\n",
264-
" perplexity = min(number_of_nodes - 1.0, 30.0)\n",
265-
" print(\"t-SNE: Sample size (Number of nodes)={size}\".format(size = number_of_nodes))\n",
266-
" print(\"t-SNE: perplexity={perplexity}\".format(perplexity=perplexity))\n",
267-
"\n",
268261
" # Use t-distributed stochastic neighbor embedding (t-SNE) to reduce the dimensionality \n",
269262
" # of the previously calculated node embeddings to 2 dimensions for visualization\n",
270-
" t_distributed_stochastic_neighbor_embedding = TSNE(n_components=2, perplexity=perplexity, verbose=1, random_state=50)\n",
263+
" t_distributed_stochastic_neighbor_embedding = TSNE(n_components=2, verbose=1, random_state=47)\n",
271264
" two_dimension_node_embeddings = t_distributed_stochastic_neighbor_embedding.fit_transform(embeddings_as_numpy_array)\n",
272265
" display(two_dimension_node_embeddings.shape) # Display the shape of the t-SNE result\n",
273266
"\n",
@@ -365,7 +358,9 @@
365358
"source": [
366359
"### 1.1 Generate Node Embeddings for Typescript Modules using Fast Random Projection (Fast RP)\n",
367360
"\n",
368-
"[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors."
361+
"[Fast Random Projection](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/fastrp) is used to reduce the dimensionality of the node feature space while preserving most of the distance information. Nodes with similar neighborhood result in node embedding with similar vectors.\n",
362+
"\n",
363+
"**👉 Hint:** To skip existing node embeddings and always calculate them based on the parameters below edit `Node_Embeddings_0a_Query_Calculated` so that it won't return any results."
369364
]
370365
},
371366
{
@@ -514,7 +509,7 @@
514509
"name": "python",
515510
"nbconvert_exporter": "python",
516511
"pygments_lexer": "ipython3",
517-
"version": "3.11.4"
512+
"version": "3.11.9"
518513
},
519514
"title": "Object Oriented Design Quality Metrics for Java with Neo4j"
520515
},

jupyter/environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ dependencies:
1010
- numpy=1.23.*
1111
- pandas=1.5.*
1212
- pip=22.3.*
13-
- scikit-learn=1.3.* # NodeEmbeddings.ipynb uses sklearn.manifold.TSNE
13+
- opentsne=1.0.* # to visualize node embeddings in 2D (t-SNE dimensionality reduction)
1414
- pip:
1515
- monotonic==1.*
1616
- wordcloud==1.9.*

0 commit comments

Comments
 (0)