Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/java-code-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ jobs:
env:
NEO4J_INITIAL_PASSWORD: ${{ secrets.NEO4J_INITIAL_PASSWORD }}
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION: "true"
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT: "full" # Options: "none", "aggregated", "full"
run: |
./../../scripts/analysis/analyze.sh

Expand Down
1 change: 1 addition & 0 deletions .github/workflows/typescript-code-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ jobs:
env:
NEO4J_INITIAL_PASSWORD: ${{ secrets.NEO4J_INITIAL_PASSWORD }}
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION: "true"
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT: "full" # Options: "none", "aggregated", "full"
run: |
./../../scripts/analysis/analyze.sh

Expand Down
42 changes: 42 additions & 0 deletions COMMANDS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- [Start an analysis with CSV reports only](#start-an-analysis-with-csv-reports-only)
- [Start an analysis with Jupyter reports only](#start-an-analysis-with-jupyter-reports-only)
- [Start an analysis with PDF generation](#start-an-analysis-with-pdf-generation)
- [Start an analysis without importing git log data](#start-an-analysis-without-importing-git-log-data)
- [Only run setup and explore the Graph manually](#only-run-setup-and-explore-the-graph-manually)
- [Generate Markdown References](#generate-markdown-references)
- [Generate Cypher Reference](#generate-cypher-reference)
Expand All @@ -24,6 +25,10 @@
- [Setup jQAssistant Java Code Analyzer](#setup-jqassistant-java-code-analyzer)
- [Download Maven Artifacts to analyze](#download-maven-artifacts-to-analyze)
- [Reset the database and scan the java artifacts](#reset-the-database-and-scan-the-java-artifacts)
- [Import git log](#import-git-log)
- [Parameters](#parameters)
- [Resolving git files to code files](#resolving-git-files-to-code-files)
- [Import aggregated git log](#import-aggregated-git-log)
- [Database Queries](#database-queries)
- [Cypher Shell](#cypher-shell)
- [HTTP API](#http-api)
Expand Down Expand Up @@ -100,6 +105,14 @@ Note: Generating a PDF from a Jupyter notebook using [nbconvert](https://nbconve
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION=true ./../../scripts/analysis/analyze.sh
```

#### Start an analysis without importing git log data

To speed up analysis and get a smaller data footprint you can switch of git log data import of the "source" directory (if present) with `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"` as shown below or choose `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated"` to reduce data size by only importing monthly grouped changes instead of all commits.

```shell
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" ./../../scripts/analysis/analyze.sh
```

#### Only run setup and explore the Graph manually

To prepare everything for analysis including installation, configuration and preparation queries to explore the graph manually
Expand Down Expand Up @@ -214,6 +227,35 @@ enhance the data further with relationships between artifacts and packages.

Be aware that this script deletes all previous relationships and nodes in the local Neo4j Graph database.

### Import git log

Use [importGitLog.sh](./scripts/importGitLog.sh) to import git log data into the Graph.
It uses `git log` to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema:

```Cypher
(Git:Log:Author)-[:AUTHORED]->(Git:Log:Commit)->[:CONTAINS]->(Git:Log:File)
```

👉**Note:** Commit messages containing `[bot]` are filtered out to ignore changes made by bots.

#### Parameters

The optional parameter `--repository directory-path-to-a-git-repository` can be used to select a different directory for the repository. By default, the `source` directory within the analysis workspace directory is used. This command only needs the git history to be present so a `git clone --bare` is sufficient. If the `source` directory is also used for the analysis then a full git clone is of course needed (like for Typescript).

#### Resolving git files to code files

After git log data has been imported successfully, [Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher](./cypher/GitLog/Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher) is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.

You can use [List_unresolved_git_files.cypher](./cypher/GitLog/List_unresolved_git_files.cypher) to find code files that couldn't be matched to git file names and [List_ambiguous_git_files.cypher](./cypher/GitLog/List_ambiguous_git_files.cypher) to find ambiguously resolved git files. If you have any idea on how to improve this feel free to [open an issue](https://github.com/JohT/code-graph-analysis-pipeline/issues/new).

### Import aggregated git log

Use [importAggregatedGitLog.sh](./scripts/importAggregatedGitLog.sh) to import git log data in an aggregated form into the Graph. It works similar to the [full git log version above](#import-git-log). The only difference is that not every single commit is imported. Instead, changes are grouped per month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. Here is the resulting schema:

```Cypher
(Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS]->(Git:Log:File)
```

## Database Queries

### Cypher Shell
Expand Down
18 changes: 10 additions & 8 deletions GETTING_STARTED.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Please read through the [Prerequisites](./README.md#hammer_and_wrench-prerequisi
cd MyFirstAnalysis
```

1. Choose an initial password for Neo4j
1. Choose an initial password for Neo4j if not already done

```shell
export NEO4J_INITIAL_PASSWORD=theinitialpasswordthatihavechosenforneo4j
Expand All @@ -36,9 +36,11 @@ Please read through the [Prerequisites](./README.md#hammer_and_wrench-prerequisi
mkdir artifacts
```

1. Move the artifacts you want to analyze into the `artifacts` directory
1. Move the artifacts (Java jar or Typescript analysis json files) you want to analyze into the `artifacts` directory

1. Optionally run a predefined script to download artifacts
1. Optionally, create a `source` directory and clone the corresponding source code into it to also gather git log data.

1. Alternatively to the steps above, run an already predefined download script

```shell
./../../scripts/downloader/downloadAxonFramework.sh <version>
Expand All @@ -48,31 +50,31 @@ Please read through the [Prerequisites](./README.md#hammer_and_wrench-prerequisi

1. Start the analysis

- Without any additional dependencies:
- Without any additional dependencies:

```shell
./../../scripts/analysis/analyze.sh --report Csv
```

- Jupyter notebook reports when Python and Conda are installed:
- Jupyter notebook reports when Python and Conda are installed:

```shell
./../../scripts/analysis/analyze.sh --report Jupyter
```

- Graph visualizations when Node.js and npm are installed:
- Graph visualizations when Node.js and npm are installed:

```shell
./../../scripts/analysis/analyze.sh --report Jupyter
```

- All reports with Python, Conda, Node.js and npm installed:
- All reports with Python, Conda, Node.js and npm installed:

```shell
./../../scripts/analysis/analyze.sh
```

- To explore the database yourself without any automatically generated reports and no additional requirements:
- To explore the database yourself without any automatically generated reports and no additional requirements:

```shell
./../../scripts/analysis/analyze.sh --explore
Expand Down
27 changes: 24 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,9 @@ This could be as simple as running the following command in your Typescript proj
npx --yes @jqassistant/ts-lce
```

- Copy the resulting json file (e.g. `.reports/jqa/ts-output.json`) into the "artifacts" directory for your analysis work directory. Custom subdirectories within "artifacts" are also supported.
- It is recommended to put the cloned source code repository into a directory called `source` within the analysis workspace so that it will also be picked up to import git log data.

- Copy the resulting json file (e.g. `.reports/jqa/ts-output.json`) into the `artifacts` directory for your analysis work directory. Custom subdirectories within `artifacts` are also supported.

## :rocket: Getting Started

Expand All @@ -105,7 +107,7 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
- [Checkout GIT Repository](https://github.com/actions/checkout)
- [Setup Java](https://github.com/actions/setup-java)
- [Setup Python with Conda](https://github.com/conda-incubator/setup-miniconda) package manager [Mambaforge](https://github.com/conda-forge/miniforge#mambaforge)
- Download artifacts that contain the code to be analyzed [scripts/artifacts](./scripts/downloader/)
- Download artifacts and optionally source code that contain the code to be analyzed [scripts/downloader](./scripts/downloader)
- Setup [Neo4j](https://neo4j.com) Graph Database ([analysis.sh](./scripts/analysis/analyze.sh))
- Setup [jQAssistant](https://jqassistant.github.io/jqassistant/doc) for Java and [Typescript](https://github.com/jqassistant-plugin/jqassistant-typescript-plugin) analysis ([analysis.sh](./scripts/analysis/analyze.sh))
- Start [Neo4j](https://neo4j.com) Graph Database ([analysis.sh](./scripts/analysis/analyze.sh))
Expand Down Expand Up @@ -176,7 +178,7 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
👉 The script will automatically be included because of the directory and its name ending with "Jupyter.sh".

- How can i add another code basis to be analyzed automatically?
👉 Create a new artifacts download script in the [scripts/downloader](./scripts/downloader/) directory. Take for example [downloadAxonFramework.sh](./scripts/downloader/downloadAxonFramework.sh) as a reference.
👉 Create a new download script in the [scripts/downloader](./scripts/downloader/) directory. Take for example [downloadAxonFramework.sh](./scripts/downloader/downloadAxonFramework.sh) as a reference.
👉 Run the script separately before executing [analyze.sh](./scripts/analysis/analyze.sh) also in the [pipeline](./.github/workflows/java-code-analysis.yml).

- How can i trigger a full re-scan of all artifacts?
Expand All @@ -195,6 +197,25 @@ The [Code Structure Analysis Pipeline](./.github/workflows/java-code-analysis.ym
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION=true ./../../scripts/analysis/analyze.sh
```

- How can i disable git log data import?
👉 Set environment variable `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT` to `none`. Example:

```shell
export IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"
```

👉 Alternatively prepend your command with `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"`:

```shell
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" ./../../scripts/analysis/analyze.sh
```

👉 An in-between option would be to only import monthly aggregated changes using `IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated"`:

```shell
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/analysis/analyze.sh
```

- Why are some Jupyter Notebook reports skipped?
👉 The custom Jupyter Notebook metadata property `code_graph_analysis_pipeline_data_validation` can be set to choose a query from [cypher/Validation](./cypher/Validation) that will be executed preliminary to the notebook. If the query leads to at least one result, the validation succeeds and the notebook will be run. If the query leads to no result, the notebook will be skipped.
For more details see [Data Availability Validation](./COMMANDS.md#data-availability-validation).
Expand Down

This file was deleted.

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
// List external Java types used

MATCH (external:Java:ExternalType) RETURN external.fqn
MATCH (external:Java:ExternalType)
RETURN labels(external), count(DISTINCT external.fqn) as numberOfExternalTypes
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
// Connect git files to code files with a RESOLVES_TO relationship if their names match
// Note: Even if is tempting to combine this file with the Typescript variant, they are intentionally spearated.
// The differences are subtle but need to be thought through and tested carefully.
// Having separate files makes it obvious that there needs to be one for every new source code language.

MATCH (code_file:File&!Git)
WHERE NOT EXISTS { (code_file)-[:RESOLVES_TO]->(other_file:File&!Git) } // only original nodes, no duplicates
WITH code_file, replace(code_file.fileName, '.class', '.java') AS codeFileName
MATCH (git_file:File&Git)
WHERE git_file.fileName ENDS WITH codeFileName
MERGE (git_file)-[:RESOLVES_TO]->(code_file)
SET git_file.resolved = true
RETURN labels(code_file)[0..4] AS codeFileLabels
,count(DISTINCT codeFileName) AS numberOfCodeFiles
,collect(DISTINCT codeFileName + ' <-> ' + git_file.fileName + '\n')[0..4] AS examples
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
// Connect git files to Typescript files with a RESOLVES_TO relationship if their names match
// Note: Even if is tempting to combine this file with the Java variant, they are intentionally spearated.
// The differences are subtle but need to be thought through and tested carefully.
// Having separate files makes it obvious that there needs to be one for every new source code language.

MATCH (code_file:File&!Git)
WHERE NOT EXISTS { (code_file)-[:RESOLVES_TO]->(other_file:File&!Git) } // only original nodes, no duplicates
WITH code_file, code_file.absoluteFileName AS codeFileName
MATCH (git_file:File&Git)
WHERE codeFileName ENDS WITH git_file.fileName
MERGE (git_file)-[:RESOLVES_TO]->(code_file)
SET git_file.resolved = true
RETURN labels(code_file)[0..4] AS codeFileLabels
,count(DISTINCT codeFileName) AS numberOfCodeFiles
,collect(DISTINCT codeFileName + ' <-> ' + git_file.fileName + '\n')[0..4] AS examples
7 changes: 7 additions & 0 deletions cypher/GitLog/Delete_git_log_data.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
// Delete all Git log data in the Graph

MATCH (n:Git)
CALL { WITH n
DETACH DELETE n
} IN TRANSACTIONS OF 1000 ROWS
RETURN count(n) as numberOfDeletedRows
17 changes: 17 additions & 0 deletions cypher/GitLog/Import_aggregated_git_log_csv_data.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
// Import aggregated git log CSV data with the following schema: (Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS]->(Git:Log:File)

LOAD CSV WITH HEADERS FROM "file:///aggregatedGitLog.csv" AS row
CALL { WITH row
MERGE (git_author:Git:Log:Author {name: row.author, email: row.email})
MERGE (git_change_span:Git:Log:ChangeSpan {
year: toInteger(row.year),
month: toInteger(row.month),
commits: toInteger(row.commits)
})
MERGE (git_file:Git:Log:File {fileName: row.filename})
MERGE (git_author)-[:AUTHORED]->(git_change_span)
MERGE (git_change_span)-[:CONTAINS]->(git_file)
} IN TRANSACTIONS OF 1000 ROWS
RETURN count(DISTINCT row.author) AS numberOfAuthors
,count(DISTINCT row.filename) AS numberOfFiles
,sum(toInteger(row.commits)) AS numberOfCommits
18 changes: 18 additions & 0 deletions cypher/GitLog/Import_git_log_csv_data.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
// Import git log CSV data with the following schema: (Git:Log:Author)-[:AUTHORED]->(Git:Log:Commit)-[:CONTAINS]->(Git:Log:File)

LOAD CSV WITH HEADERS FROM "file:///gitLog.csv" AS row
CALL { WITH row
MERGE (git_author:Git:Log:Author {name: row.author, email: row.email})
MERGE (git_commit:Git:Log:Commit {
hash: row.hash,
message: row.message,
timestamp: datetime(row.timestamp),
timestamp_unix: toInteger(row.timestamp_unix)
})
MERGE (git_file:Git:Log:File {fileName: row.filename})
MERGE (git_author)-[:AUTHORED]->(git_commit)
MERGE (git_commit)-[:CONTAINS]->(git_file)
} IN TRANSACTIONS OF 1000 ROWS
RETURN count(DISTINCT row.author) AS numberOfAuthors
,count(DISTINCT row.filename) AS numberOfFiles
,count(DISTINCT row.hash) AS numberOfCommits
3 changes: 3 additions & 0 deletions cypher/GitLog/Index_author_name.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
// Create index for author name (git data)

CREATE INDEX INDEX_AUTHOR_NAME IF NOT EXISTS FOR (n:Author) ON (n.name)
3 changes: 3 additions & 0 deletions cypher/GitLog/Index_change_span_year.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
// Create index for change span year (aggregated git data)

CREATE INDEX INDEX_CHANGE_SPAN_YEAR IF NOT EXISTS FOR (n:ChangeSpan) ON (n.year)
3 changes: 3 additions & 0 deletions cypher/GitLog/Index_commit_hash.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
// Create index for commit hash (git data)

CREATE INDEX INDEX_COMMIT_HASH IF NOT EXISTS FOR (n:Commit) ON (n.hash)
3 changes: 3 additions & 0 deletions cypher/GitLog/Index_file_name.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
// Create index for the file name

CREATE INDEX INDEX_FILE_NAME IF NOT EXISTS FOR (t:File) ON (t.fileName)
20 changes: 20 additions & 0 deletions cypher/GitLog/List_ambiguous_git_files.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
// List ambigiously resolved git files where a single git file is attached to more than one code file for troubleshooting/testing.

MATCH (file:File&!Git)<-[:RESOLVES_TO]-(git_file:File&Git)
OPTIONAL MATCH (artifact:Artifact:Archive)-[:CONTAINS]->(file)
WITH file.fileName AS fileName
,reverse(split(reverse(file.fileName),'.')[0]) AS fileExtension
,count(DISTINCT git_file.fileName) AS gitFilesCount
,collect(DISTINCT split(git_file.fileName,'/')[0])[0..6] AS gitFileFirstPathExamples
,collect(DISTINCT git_file.fileName)[0..6] AS gitFileExamples
,collect(DISTINCT artifact.fileName) AS artifacts
WHERE gitFilesCount > 1
RETURN fileName
,fileExtension
,gitFilesCount
,count(*) AS numberOfCases
,artifacts
,gitFileFirstPathExamples
,gitFileExamples
ORDER BY gitFilesCount DESC, fileName ASC
LIMIT 50
9 changes: 9 additions & 0 deletions cypher/GitLog/List_unresolved_git_files.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
// List code files not covered by imported git data for troubleshooting/testing.

MATCH (code_file:File&!Git&!Directory)
WHERE NOT EXISTS { (code_file)<-[:RESOLVES_TO]-(git_file:File&Git) }
RETURN reverse(split(reverse(code_file.fileName),'.')[0]) AS codeFileExtension
,labels(code_file)[0..2] AS firstThreeCodeFileLabels
,count(DISTINCT code_file.fileName) AS codeFileCount
,collect(DISTINCT code_file.fileName)[0..6] AS codeFileExamples
LIMIT 50
7 changes: 7 additions & 0 deletions cypher/GitLog/Set_number_of_aggregated_git_commits.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
// Set numberOfGitCommits property on code File nodes when aggregated change spans with grouped commits are present.

MATCH (code_file:File&!Git)<-[:RESOLVES_TO]-(git_file:File&Git)
MATCH (git_file)<-[:CONTAINS]-(git_changespan:Git:ChangeSpan)
WITH code_file, sum(git_changespan.commits) AS numberOfGitCommits
SET code_file.numberOfGitCommits = numberOfGitCommits
RETURN count(DISTINCT coalesce(code_file.absoluteFileName, code_file.fileName)) AS changedCodeFiles
7 changes: 7 additions & 0 deletions cypher/GitLog/Set_number_of_git_commits.cypher
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
// Set numberOfGitCommits property on code File nodes when git commits are present

MATCH (code_file:File&!Git)<-[:RESOLVES_TO]-(git_file:File&Git)
MATCH (git_file)<-[:CONTAINS]-(git_commit:Git:Commit)
WITH code_file, count(DISTINCT git_commit.hash) AS numberOfGitCommits
SET code_file.numberOfGitCommits = numberOfGitCommits
RETURN count(DISTINCT coalesce(code_file.absoluteFileName, code_file.fileName)) AS changedCodeFiles
Loading