Skip to content

Commit edc05a7

Browse files
authored
Merge pull request #157 from JohT/feature/import-git-log-into-the-graph
Provide script to import git log as csv
2 parents dcd9c29 + de72cba commit edc05a7

27 files changed

+533
-59
lines changed

.github/workflows/java-code-analysis.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ jobs:
127127
env:
128128
NEO4J_INITIAL_PASSWORD: ${{ secrets.NEO4J_INITIAL_PASSWORD }}
129129
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION: "true"
130+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT: "full" # Options: "none", "aggregated", "full"
130131
run: |
131132
./../../scripts/analysis/analyze.sh
132133

.github/workflows/typescript-code-analysis.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ jobs:
132132
env:
133133
NEO4J_INITIAL_PASSWORD: ${{ secrets.NEO4J_INITIAL_PASSWORD }}
134134
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION: "true"
135+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT: "full" # Options: "none", "aggregated", "full"
135136
run: |
136137
./../../scripts/analysis/analyze.sh
137138

COMMANDS.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- [Start an analysis with CSV reports only](#start-an-analysis-with-csv-reports-only)
1010
- [Start an analysis with Jupyter reports only](#start-an-analysis-with-jupyter-reports-only)
1111
- [Start an analysis with PDF generation](#start-an-analysis-with-pdf-generation)
12+
- [Start an analysis without importing git log data](#start-an-analysis-without-importing-git-log-data)
1213
- [Only run setup and explore the Graph manually](#only-run-setup-and-explore-the-graph-manually)
1314
- [Generate Markdown References](#generate-markdown-references)
1415
- [Generate Cypher Reference](#generate-cypher-reference)
@@ -24,6 +25,10 @@
2425
- [Setup jQAssistant Java Code Analyzer](#setup-jqassistant-java-code-analyzer)
2526
- [Download Maven Artifacts to analyze](#download-maven-artifacts-to-analyze)
2627
- [Reset the database and scan the java artifacts](#reset-the-database-and-scan-the-java-artifacts)
28+
- [Import git log](#import-git-log)
29+
- [Parameters](#parameters)
30+
- [Resolving git files to code files](#resolving-git-files-to-code-files)
31+
- [Import aggregated git log](#import-aggregated-git-log)
2732
- [Database Queries](#database-queries)
2833
- [Cypher Shell](#cypher-shell)
2934
- [HTTP API](#http-api)
@@ -100,6 +105,14 @@ Note: Generating a PDF from a Jupyter notebook using [nbconvert](https://nbconve
100105
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION=true ./../../scripts/analysis/analyze.sh
101106
```
102107

108+
#### Start an analysis without importing git log data
109+
110+
To speed up analysis and get a smaller data footprint you can switch of git log data import of the "source" directory (if present) with `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"` as shown below or choose `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated"` to reduce data size by only importing monthly grouped changes instead of all commits.
111+
112+
```shell
113+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" ./../../scripts/analysis/analyze.sh
114+
```
115+
103116
#### Only run setup and explore the Graph manually
104117

105118
To prepare everything for analysis including installation, configuration and preparation queries to explore the graph manually
@@ -214,6 +227,35 @@ enhance the data further with relationships between artifacts and packages.
214227

215228
Be aware that this script deletes all previous relationships and nodes in the local Neo4j Graph database.
216229

230+
### Import git log
231+
232+
Use [importGitLog.sh](./scripts/importGitLog.sh) to import git log data into the Graph.
233+
It uses `git log` to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema:
234+
235+
```Cypher
236+
(Git:Log:Author)-[:AUTHORED]->(Git:Log:Commit)->[:CONTAINS]->(Git:Log:File)
237+
```
238+
239+
👉**Note:** Commit messages containing `[bot]` are filtered out to ignore changes made by bots.
240+
241+
#### Parameters
242+
243+
The optional parameter `--repository directory-path-to-a-git-repository` can be used to select a different directory for the repository. By default, the `source` directory within the analysis workspace directory is used. This command only needs the git history to be present so a `git clone --bare` is sufficient. If the `source` directory is also used for the analysis then a full git clone is of course needed (like for Typescript).
244+
245+
#### Resolving git files to code files
246+
247+
After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.
248+
249+
You can use [List_unresolved_git_files.cypher](./cypher/GitLog/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./cypher/GitLog/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new).
250+
251+
### Import aggregated git log
252+
253+
Use [importAggregatedGitLog.sh](./scripts/importAggregatedGitLog.sh) to import git log data in an aggregated form into the Graph. It works similar to the [full git log version above](#import-git-log). The only difference is that not every single commit is imported. Instead, changes are grouped per month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. Here is the resulting schema:
254+
255+
```Cypher
256+
(Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS]->(Git:Log:File)
257+
```
258+
217259
## Database Queries
218260

219261
### Cypher Shell

GETTING_STARTED.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Please read through the [Prerequisites](./README.md#hammer_and_wrench-prerequisi
2424
cd MyFirstAnalysis
2525
```
2626

27-
1. Choose an initial password for Neo4j
27+
1. Choose an initial password for Neo4j if not already done
2828

2929
```shell
3030
export NEO4J_INITIAL_PASSWORD=theinitialpasswordthatihavechosenforneo4j
@@ -36,9 +36,11 @@ Please read through the [Prerequisites](./README.md#hammer_and_wrench-prerequisi
3636
mkdir artifacts
3737
```
3838

39-
1. Move the artifacts you want to analyze into the `artifacts` directory
39+
1. Move the artifacts (Java jar or Typescript analysis json files) you want to analyze into the `artifacts` directory
4040

41-
1. Optionally run a predefined script to download artifacts
41+
1. Optionally, create a `source` directory and clone the corresponding source code into it to also gather git log data.
42+
43+
1. Alternatively to the steps above, run an already predefined download script
4244

4345
```shell
4446
./../../scripts/downloader/downloadAxonFramework.sh <version>
@@ -48,31 +50,31 @@ Please read through the [Prerequisites](./README.md#hammer_and_wrench-prerequisi
4850

4951
1. Start the analysis
5052

51-
- Without any additional dependencies:
53+
- Without any additional dependencies:
5254

5355
```shell
5456
./../../scripts/analysis/analyze.sh --report Csv
5557
```
5658

57-
- Jupyter notebook reports when Python and Conda are installed:
59+
- Jupyter notebook reports when Python and Conda are installed:
5860

5961
```shell
6062
./../../scripts/analysis/analyze.sh --report Jupyter
6163
```
6264

63-
- Graph visualizations when Node.js and npm are installed:
65+
- Graph visualizations when Node.js and npm are installed:
6466

6567
```shell
6668
./../../scripts/analysis/analyze.sh --report Jupyter
6769
```
6870

69-
- All reports with Python, Conda, Node.js and npm installed:
71+
- All reports with Python, Conda, Node.js and npm installed:
7072

7173
```shell
7274
./../../scripts/analysis/analyze.sh
7375
```
7476

75-
- To explore the database yourself without any automatically generated reports and no additional requirements:
77+
- To explore the database yourself without any automatically generated reports and no additional requirements:
7678

7779
```shell
7880
./../../scripts/analysis/analyze.sh --explore

README.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,9 @@ This could be as simple as running the following command in your Typescript proj
9191
npx --yes @jqassistant/ts-lce
9292
```
9393

94-
- Copy the resulting json file (e.g. `.reports/jqa/ts-output.json`) into the "artifacts" directory for your analysis work directory. Custom subdirectories within "artifacts" are also supported.
94+
- It is recommended to put the cloned source code repository into a directory called `source` within the analysis workspace so that it will also be picked up to import git log data.
95+
96+
- Copy the resulting json file (e.g. `.reports/jqa/ts-output.json`) into the `artifacts` directory for your analysis work directory. Custom subdirectories within `artifacts` are also supported.
9597

9698
## :rocket: Getting Started
9799

@@ -105,7 +107,7 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
105107
- [Checkout GIT Repository](https://github.com/actions/checkout)
106108
- [Setup Java](https://github.com/actions/setup-java)
107109
- [Setup Python with Conda](https://github.com/conda-incubator/setup-miniconda) package manager [Mambaforge](https://github.com/conda-forge/miniforge#mambaforge)
108-
- Download artifacts that contain the code to be analyzed [scripts/artifacts](./scripts/downloader/)
110+
- Download artifacts and optionally source code that contain the code to be analyzed [scripts/downloader](./scripts/downloader)
109111
- Setup [Neo4j](https://neo4j.com) Graph Database ([analysis.sh](./scripts/analysis/analyze.sh))
110112
- Setup [jQAssistant](https://jqassistant.github.io/jqassistant/doc) for Java and [Typescript](https://github.com/jqassistant-plugin/jqassistant-typescript-plugin) analysis ([analysis.sh](./scripts/analysis/analyze.sh))
111113
- Start [Neo4j](https://neo4j.com) Graph Database ([analysis.sh](./scripts/analysis/analyze.sh))
@@ -176,7 +178,7 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
176178
👉 The script will automatically be included because of the directory and its name ending with "Jupyter.sh".
177179

178180
- How can i add another code basis to be analyzed automatically?
179-
👉 Create a new artifacts download script in the [scripts/downloader](./scripts/downloader/) directory. Take for example [downloadAxonFramework.sh](./scripts/downloader/downloadAxonFramework.sh) as a reference.
181+
👉 Create a new download script in the [scripts/downloader](./scripts/downloader/) directory. Take for example [downloadAxonFramework.sh](./scripts/downloader/downloadAxonFramework.sh) as a reference.
180182
👉 Run the script separately before executing [analyze.sh](./scripts/analysis/analyze.sh) also in the [pipeline](./.github/workflows/java-code-analysis.yml).
181183

182184
- How can i trigger a full re-scan of all artifacts?
@@ -195,6 +197,25 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
195197
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION=true ./../../scripts/analysis/analyze.sh
196198
```
197199

200+
- How can i disable git log data import?
201+
👉 Set environment variable `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` to `none`. Example:
202+
203+
```shell
204+
export IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"
205+
```
206+
207+
👉 Alternatively prepend your command with `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"`:
208+
209+
```shell
210+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" ./../../scripts/analysis/analyze.sh
211+
```
212+
213+
👉 An in-between option would be to only import monthly aggregated changes using `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated"`:
214+
215+
```shell
216+
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/analysis/analyze.sh
217+
```
218+
198219
- Why are some Jupyter Notebook reports skipped?
199220
👉 The custom Jupyter Notebook metadata property `code_graph_analysis_pipeline_data_validation` can be set to choose a query from [cypher/Validation](./cypher/Validation) that will be executed preliminary to the notebook. If the query leads to at least one result, the validation succeeds and the notebook will be run. If the query leads to no result, the notebook will be skipped.
200221
For more details see [Data Availability Validation](./COMMANDS.md#data-availability-validation).

cypher/Create_a_DEPENDS_ON_relationship_for_every_DEPENDS_ON_ARTIFACT.cypher

Lines changed: 0 additions & 8 deletions
This file was deleted.

cypher/Create_a_DEPENDS_ON_relationship_for_every_DEPENDS_ON_PACKAGE.cypher

Lines changed: 0 additions & 7 deletions
This file was deleted.
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
// List external Java types used
22

3-
MATCH (external:Java:ExternalType) RETURN external.fqn
3+
MATCH (external:Java:ExternalType)
4+
RETURN labels(external), count(DISTINCT external.fqn) as numberOfExternalTypes
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
// Connect git files to code files with a RESOLVES_TO relationship if their names match
2+
// Note: Even if is tempting to combine this file with the Typescript variant, they are intentionally spearated.
3+
// The differences are subtle but need to be thought through and tested carefully.
4+
// Having separate files makes it obvious that there needs to be one for every new source code language.
5+
6+
MATCH (code_file:File&!Git)
7+
WHERE NOT EXISTS { (code_file)-[:RESOLVES_TO]->(other_file:File&!Git) } // only original nodes, no duplicates
8+
WITH code_file, replace(code_file.fileName, '.class', '.java') AS codeFileName
9+
MATCH (git_file:File&Git)
10+
WHERE git_file.fileName ENDS WITH codeFileName
11+
MERGE (git_file)-[:RESOLVES_TO]->(code_file)
12+
SET git_file.resolved = true
13+
RETURN labels(code_file)[0..4] AS codeFileLabels
14+
,count(DISTINCT codeFileName) AS numberOfCodeFiles
15+
,collect(DISTINCT codeFileName + ' <-> ' + git_file.fileName + '\n')[0..4] AS examples
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
// Connect git files to Typescript files with a RESOLVES_TO relationship if their names match
2+
// Note: Even if is tempting to combine this file with the Java variant, they are intentionally spearated.
3+
// The differences are subtle but need to be thought through and tested carefully.
4+
// Having separate files makes it obvious that there needs to be one for every new source code language.
5+
6+
MATCH (code_file:File&!Git)
7+
WHERE NOT EXISTS { (code_file)-[:RESOLVES_TO]->(other_file:File&!Git) } // only original nodes, no duplicates
8+
WITH code_file, code_file.absoluteFileName AS codeFileName
9+
MATCH (git_file:File&Git)
10+
WHERE codeFileName ENDS WITH git_file.fileName
11+
MERGE (git_file)-[:RESOLVES_TO]->(code_file)
12+
SET git_file.resolved = true
13+
RETURN labels(code_file)[0..4] AS codeFileLabels
14+
,count(DISTINCT codeFileName) AS numberOfCodeFiles
15+
,collect(DISTINCT codeFileName + ' <-> ' + git_file.fileName + '\n')[0..4] AS examples

0 commit comments

Comments
 (0)