Skip to content

Dev/extended #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,10 +70,18 @@ Release data: Dec 12, 2022

Version 0.0.6
=============
Release data: Jan 9, 2022
Release data: Jan 9, 2023

* Add tree sitter utils (in codetext.parser)
* Replace all `match_from_span` to `get_node_text`
* Replace all `traverse_type` to `get_node_by_kind`
* Fix `CppParser.get_function_metadata` missing `param_type` and `param_identifier`
* Update return metadata from all parser

Version 0.0.7
=============
Release data: Jul 5, 2023

* Update all class extractor format (using dict instead of list)
* Fix missing identifier, parameter in C, C#, Java parser
* Implement CLI
137 changes: 108 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,152 @@
<div align="center">

<p align="center">
<img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo">
<img src="./asset/img/codetext_logo.png" width="220px" alt="logo">
</p>

**CodeText-parser**
______________________________________________________________________


<!-- Badge start -->
| Branch | Build | Unittest | Linting | Release | License |
|-------- |------- |---------- |--------- |--------- |--------- |
| main | | [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
| Branch | Build | Unittest | Release | License |
|-------- |------- |---------- |--------- |--------- |
| main | | [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
<!-- Badge end -->
</div>

______________________________________________________________________

**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).
**Code-Text parser** is a custom [tree-sitter](https://github.com/tree-sitter)'s grammar parser for extract raw source code into class and function level. We support 10 common programming languages:
- Python
- Java
- JavaScript
- PHP
- Ruby
- Rust
- C
- C++
- C#
- Go

# Installation
Setup environment and install dependencies and setup by using `install_env.sh`
```bash
bash -i ./install_env.sh
```
then activate conda environment named "code-text-env"
**codetext** package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:
```bash
conda activate code-text-env
git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
pip install -r requirement.txt
pip install -e .
```

*Setup for using parser*
Or install via `pypi` package:
```bash
pip install codetext
```

# Getting started

## Build your language
Auto build tree-sitter into `<language>.so` located in `/tree-sitter/`
## `codetext` CLI Usage
```bash
codetext [options] [PATH or FILE] ...
```

For example extract any python file in `src/` folder:
```bash
codetext src/ --language Python
```

If you want to store extracted class and function, use flag `--json` and give a path to destination file:
```bash
codetext src/ --language Python --output_file ./python_report.json --json
```

**Options**

```bash
positional arguments:
paths list of the filename/paths.

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-l LANGUAGE, --language LANGUAGE
Target the programming languages you want to analyze.
-o OUTPUT_FILE, --output_file OUTPUT_FILE
Output file (e.g report.json).
--json Generate json output as a transform of the default
output
--verbose Print progress bar

```

**Example**
```
File circle_linkedlist.py analyzed:
==================================================
Number of class : 1
Number of function : 2
--------------------------------------------------

Class summary:
+-----+---------+-------------+
| # | Class | Arguments |
+=====+=========+=============+
| 0 | Node | |
+-----+---------+-------------+

Class analyse: Node
+-----+---------------+-------------+--------+---------------+
| # | Method name | Paramters | Type | Return type |
+=====+===============+=============+========+===============+
| 0 | __init__ | self | | |
| | | data | | |
+-----+---------------+-------------+--------+---------------+

Function analyse:
+-----+-----------------+-------------+--------+---------------+
| # | Function name | Paramters | Type | Return type |
+=====+=================+=============+========+===============+
| 0 | push | head_ref | | Node |
| | | data | Any | Node |
| 1 | countNodes | head | Node | |
+-----+-----------------+-------------+--------+---------------+
```

## Using `codetext` as Python module
### Build your language
`codetext` need tree-sitter language file (i.e `.so` file) to work properly. You can manually compile language ([see more](https://github.com/tree-sitter/py-tree-sitter#usage)) or automatically build use our pre-defined function (the `<language>.so` will saved in a folder name `/tree-sitter/`):
```python
from codetext.utils import build_language

language = 'rust'
build_language(language)


# INFO:utils:Not found tree-sitter-rust, attempt clone from github
# Cloning into 'tree-sitter-rust'...
# remote: Enumerating objects: 2835, done. ...
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
```

## Language Parser
We supported 10 programming languages, namely `Python`, `Java`, `JavaScript`, `Golang`, `Ruby`, `PHP`, `C#`, `C++`, `C` and `Rust`.
### Using Language Parser
Each programming language we supported are correspond to a custome `language_parser`. (e.g Python is [`PythonParser()`](src/codetext/parser/python_parser.py#L11)). `language_parser` take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:

Setup
```python
from codetext.utils import parse_code

raw_code = """
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
/**
* Sum of 2 number
* @param a int number
* @param b int number
*/
double sum2num(int a, int b) {
return a + b;
}
"""

# Auto parse code into tree-sitter.Tree
root = parse_code(raw_code, 'cpp')
root_node = root.root_node
```

Get all function nodes inside a specific node, use:
Get all function nodes inside a specific node:
```python
from codetext.utils.parser import CppParser

Expand Down Expand Up @@ -105,3 +178,9 @@ class_list = CppParser.get_class_list(root_node)
# and
metadata = CppParser.get_metadata_list(root_node)
```

# Limitations
`codetext` heavly depends on tree-sitter syntax:
- Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. `codetext` is easily vulnerable by tree-sitter update patch or syntax change in future.

- While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.
Binary file added asset/img/codetext_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added asset/img/codetext_logo_line.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "codetext"
version = "0.0.5"
version = "0.0.7"
authors = [
{ name="Dung Manh Nguyen", email="[email protected]" },
]
Expand All @@ -21,8 +21,12 @@ dependencies = [
"Levenshtein>=0.20",
"langdetect>=1.0.0",
"bs4>=0.0.1",
"tabulate>=0.9.0"
]

[project.urls]
"Homepage" = "https://github.com/AI4Code-Research/CodeText-data"
"Bug Tracker" = "https://github.com/AI4Code-Research/CodeText-data/issues"

[project.scripts]
codetext = "codetext.__main__:main"
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# for preprocessing
tree-sitter
# docstring-parser
tabulate
Levenshtein
langdetect
bs4
Empty file modified src/codetext/__init__.py
100755 → 100644
Empty file.
93 changes: 93 additions & 0 deletions src/codetext/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
import os
import sys
import argparse
import pkg_resources

import json
from .codetext_cli import parse_file, print_result, PL_MATCHING


def get_args():
parser = argparse.ArgumentParser(description=f"codetext parser {20*'='}")

parser.add_argument('paths', nargs='*', default=['.'],
help='list of the filename/paths.')
parser.add_argument("--version", action="version",
version=pkg_resources.get_distribution("codetext").version)
parser.add_argument("-l", "--language",
help='''Target the programming languages you want to
analyze.''')
parser.add_argument("-o", "--output_file",
help='''Output file (e.g report.json).
''',
type=str)
parser.add_argument("--json",
help='''Generate json output as a transform of the
default output''',
action="store_true")
parser.add_argument("--verbose",
help='''Print progress bar''',
action="store_true")

return parser.parse_args()


def main():
opt = get_args()

# check args
if opt.json:
if not opt.output_file:
raise ValueError("Missing --output_file")
if opt.language:
if opt.language not in PL_MATCHING.keys():
raise ValueError(
"{language} not supported. Currently support {sp_language}"
.format(language=opt.language,
sp_language=list(PL_MATCHING.keys())))

# check path
for path in opt.paths:
assert os.path.exists(path) == True, "paths is not valid"

if os.path.isdir(path):
files = [os.path.join(path, f) for f in os.listdir(path) \
if os.path.isfile(os.path.join(path, f))]
elif os.path.isfile(path):
files = [path]

if opt.language:
for file in files[:]:
filename, file_extension = os.path.splitext(file)
if file_extension not in PL_MATCHING[opt.language]:
files.remove(file)

output_metadata = {}
for file in files:
filename, file_extension = os.path.splitext(file)

if opt.language == None:
for lang, ext_list in PL_MATCHING.items():
if file_extension in ext_list:
language = lang
break
else:
language = opt.language

output = parse_file(file, language=language)
print_result(
output,
file_name=str(filename).split(os.sep)[-1]+file_extension
)
output_metadata[file] = output

if opt.json:
save_path = opt.output_file
with open(save_path, 'w') as output_file:
json.dump(output_metadata, output_file, sort_keys=True, indent=4)
print(50*'=')
print("Save report to {path}".format(path=save_path))


if __name__ == '__main__':
main()
Empty file modified src/codetext/clean/__init__.py
100755 → 100644
Empty file.
Empty file modified src/codetext/clean/noise_removal.py
100755 → 100644
Empty file.
Loading