Skip to content

Commit 026a807

Browse files
authored
Merge pull request #3 from FSoft-AI4Code/dev/extended
Dev/extended
2 parents b6a5973 + 99e6575 commit 026a807

34 files changed

+604
-84
lines changed

HISTORY.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,10 +70,18 @@ Release data: Dec 12, 2022
7070

7171
Version 0.0.6
7272
=============
73-
Release data: Jan 9, 2022
73+
Release data: Jan 9, 2023
7474

7575
* Add tree sitter utils (in codetext.parser)
7676
* Replace all `match_from_span` to `get_node_text`
7777
* Replace all `traverse_type` to `get_node_by_kind`
7878
* Fix `CppParser.get_function_metadata` missing `param_type` and `param_identifier`
7979
* Update return metadata from all parser
80+
81+
Version 0.0.7
82+
=============
83+
Release data: Jul 5, 2023
84+
85+
* Update all class extractor format (using dict instead of list)
86+
* Fix missing identifier, parameter in C, C#, Java parser
87+
* Implement CLI

README.md

Lines changed: 108 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,79 +1,152 @@
11
<div align="center">
22

33
<p align="center">
4-
<img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo">
4+
<img src="./asset/img/codetext_logo.png" width="220px" alt="logo">
55
</p>
6-
7-
**CodeText-parser**
86
______________________________________________________________________
97

108

119
<!-- Badge start -->
12-
| Branch | Build | Unittest | Linting | Release | License |
13-
|-------- |------- |---------- |--------- |--------- |--------- |
14-
| main | | [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
10+
| Branch | Build | Unittest | Release | License |
11+
|-------- |------- |---------- |--------- |--------- |
12+
| main | | [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
1513
<!-- Badge end -->
1614
</div>
1715

1816
______________________________________________________________________
1917

20-
**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level).
18+
**Code-Text parser** is a custom [tree-sitter](https://github.com/tree-sitter)'s grammar parser for extract raw source code into class and function level. We support 10 common programming languages:
19+
- Python
20+
- Java
21+
- JavaScript
22+
- PHP
23+
- Ruby
24+
- Rust
25+
- C
26+
- C++
27+
- C#
28+
- Go
2129

2230
# Installation
23-
Setup environment and install dependencies and setup by using `install_env.sh`
24-
```bash
25-
bash -i ./install_env.sh
26-
```
27-
then activate conda environment named "code-text-env"
31+
**codetext** package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:
2832
```bash
29-
conda activate code-text-env
33+
git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
34+
pip install -r requirement.txt
35+
pip install -e .
3036
```
3137

32-
*Setup for using parser*
38+
Or install via `pypi` package:
3339
```bash
3440
pip install codetext
3541
```
3642

3743
# Getting started
3844

39-
## Build your language
40-
Auto build tree-sitter into `<language>.so` located in `/tree-sitter/`
45+
## `codetext` CLI Usage
46+
```bash
47+
codetext [options] [PATH or FILE] ...
48+
```
49+
50+
For example extract any python file in `src/` folder:
51+
```bash
52+
codetext src/ --language Python
53+
```
54+
55+
If you want to store extracted class and function, use flag `--json` and give a path to destination file:
56+
```bash
57+
codetext src/ --language Python --output_file ./python_report.json --json
58+
```
59+
60+
**Options**
61+
62+
```bash
63+
positional arguments:
64+
paths list of the filename/paths.
65+
66+
optional arguments:
67+
-h, --help show this help message and exit
68+
--version show program's version number and exit
69+
-l LANGUAGE, --language LANGUAGE
70+
Target the programming languages you want to analyze.
71+
-o OUTPUT_FILE, --output_file OUTPUT_FILE
72+
Output file (e.g report.json).
73+
--json Generate json output as a transform of the default
74+
output
75+
--verbose Print progress bar
76+
77+
```
78+
79+
**Example**
80+
```
81+
File circle_linkedlist.py analyzed:
82+
==================================================
83+
Number of class : 1
84+
Number of function : 2
85+
--------------------------------------------------
86+
87+
Class summary:
88+
+-----+---------+-------------+
89+
| # | Class | Arguments |
90+
+=====+=========+=============+
91+
| 0 | Node | |
92+
+-----+---------+-------------+
93+
94+
Class analyse: Node
95+
+-----+---------------+-------------+--------+---------------+
96+
| # | Method name | Paramters | Type | Return type |
97+
+=====+===============+=============+========+===============+
98+
| 0 | __init__ | self | | |
99+
| | | data | | |
100+
+-----+---------------+-------------+--------+---------------+
101+
102+
Function analyse:
103+
+-----+-----------------+-------------+--------+---------------+
104+
| # | Function name | Paramters | Type | Return type |
105+
+=====+=================+=============+========+===============+
106+
| 0 | push | head_ref | | Node |
107+
| | | data | Any | Node |
108+
| 1 | countNodes | head | Node | |
109+
+-----+-----------------+-------------+--------+---------------+
110+
```
111+
112+
## Using `codetext` as Python module
113+
### Build your language
114+
`codetext` need tree-sitter language file (i.e `.so` file) to work properly. You can manually compile language ([see more](https://github.com/tree-sitter/py-tree-sitter#usage)) or automatically build use our pre-defined function (the `<language>.so` will saved in a folder name `/tree-sitter/`):
41115
```python
42116
from codetext.utils import build_language
43117
44118
language = 'rust'
45119
build_language(language)
46120
47-
48121
# INFO:utils:Not found tree-sitter-rust, attempt clone from github
49122
# Cloning into 'tree-sitter-rust'...
50123
# remote: Enumerating objects: 2835, done. ...
51124
# INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
52125
```
53126
54-
## Language Parser
55-
We supported 10 programming languages, namely `Python`, `Java`, `JavaScript`, `Golang`, `Ruby`, `PHP`, `C#`, `C++`, `C` and `Rust`.
127+
### Using Language Parser
128+
Each programming language we supported are correspond to a custome `language_parser`. (e.g Python is [`PythonParser()`](src/codetext/parser/python_parser.py#L11)). `language_parser` take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:
56129
57-
Setup
58130
```python
59131
from codetext.utils import parse_code
60132
61133
raw_code = """
62-
/**
63-
* Sum of 2 number
64-
* @param a int number
65-
* @param b int number
66-
*/
67-
double sum2num(int a, int b) {
68-
return a + b;
69-
}
134+
/**
135+
* Sum of 2 number
136+
* @param a int number
137+
* @param b int number
138+
*/
139+
double sum2num(int a, int b) {
140+
return a + b;
141+
}
70142
"""
71143
144+
# Auto parse code into tree-sitter.Tree
72145
root = parse_code(raw_code, 'cpp')
73146
root_node = root.root_node
74147
```
75148
76-
Get all function nodes inside a specific node, use:
149+
Get all function nodes inside a specific node:
77150
```python
78151
from codetext.utils.parser import CppParser
79152
@@ -105,3 +178,9 @@ class_list = CppParser.get_class_list(root_node)
105178
# and
106179
metadata = CppParser.get_metadata_list(root_node)
107180
```
181+
182+
# Limitations
183+
`codetext` heavly depends on tree-sitter syntax:
184+
- Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. `codetext` is easily vulnerable by tree-sitter update patch or syntax change in future.
185+
186+
- While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.

asset/img/codetext_logo.png

14.1 KB
Loading

asset/img/codetext_logo_line.png

11.2 KB
Loading

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "codetext"
7-
version = "0.0.5"
7+
version = "0.0.7"
88
authors = [
99
{ name="Dung Manh Nguyen", email="[email protected]" },
1010
]
@@ -21,8 +21,12 @@ dependencies = [
2121
"Levenshtein>=0.20",
2222
"langdetect>=1.0.0",
2323
"bs4>=0.0.1",
24+
"tabulate>=0.9.0"
2425
]
2526

2627
[project.urls]
2728
"Homepage" = "https://github.com/AI4Code-Research/CodeText-data"
2829
"Bug Tracker" = "https://github.com/AI4Code-Research/CodeText-data/issues"
30+
31+
[project.scripts]
32+
codetext = "codetext.__main__:main"

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# for preprocessing
22
tree-sitter
3-
# docstring-parser
3+
tabulate
44
Levenshtein
55
langdetect
66
bs4

src/codetext/__init__.py

100755100644
File mode changed.

src/codetext/__main__.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
import os
2+
import sys
3+
import argparse
4+
import pkg_resources
5+
6+
import json
7+
from .codetext_cli import parse_file, print_result, PL_MATCHING
8+
9+
10+
def get_args():
11+
parser = argparse.ArgumentParser(description=f"codetext parser {20*'='}")
12+
13+
parser.add_argument('paths', nargs='*', default=['.'],
14+
help='list of the filename/paths.')
15+
parser.add_argument("--version", action="version",
16+
version=pkg_resources.get_distribution("codetext").version)
17+
parser.add_argument("-l", "--language",
18+
help='''Target the programming languages you want to
19+
analyze.''')
20+
parser.add_argument("-o", "--output_file",
21+
help='''Output file (e.g report.json).
22+
''',
23+
type=str)
24+
parser.add_argument("--json",
25+
help='''Generate json output as a transform of the
26+
default output''',
27+
action="store_true")
28+
parser.add_argument("--verbose",
29+
help='''Print progress bar''',
30+
action="store_true")
31+
32+
return parser.parse_args()
33+
34+
35+
def main():
36+
opt = get_args()
37+
38+
# check args
39+
if opt.json:
40+
if not opt.output_file:
41+
raise ValueError("Missing --output_file")
42+
if opt.language:
43+
if opt.language not in PL_MATCHING.keys():
44+
raise ValueError(
45+
"{language} not supported. Currently support {sp_language}"
46+
.format(language=opt.language,
47+
sp_language=list(PL_MATCHING.keys())))
48+
49+
# check path
50+
for path in opt.paths:
51+
assert os.path.exists(path) == True, "paths is not valid"
52+
53+
if os.path.isdir(path):
54+
files = [os.path.join(path, f) for f in os.listdir(path) \
55+
if os.path.isfile(os.path.join(path, f))]
56+
elif os.path.isfile(path):
57+
files = [path]
58+
59+
if opt.language:
60+
for file in files[:]:
61+
filename, file_extension = os.path.splitext(file)
62+
if file_extension not in PL_MATCHING[opt.language]:
63+
files.remove(file)
64+
65+
output_metadata = {}
66+
for file in files:
67+
filename, file_extension = os.path.splitext(file)
68+
69+
if opt.language == None:
70+
for lang, ext_list in PL_MATCHING.items():
71+
if file_extension in ext_list:
72+
language = lang
73+
break
74+
else:
75+
language = opt.language
76+
77+
output = parse_file(file, language=language)
78+
print_result(
79+
output,
80+
file_name=str(filename).split(os.sep)[-1]+file_extension
81+
)
82+
output_metadata[file] = output
83+
84+
if opt.json:
85+
save_path = opt.output_file
86+
with open(save_path, 'w') as output_file:
87+
json.dump(output_metadata, output_file, sort_keys=True, indent=4)
88+
print(50*'=')
89+
print("Save report to {path}".format(path=save_path))
90+
91+
92+
if __name__ == '__main__':
93+
main()

src/codetext/clean/__init__.py

100755100644
File mode changed.

src/codetext/clean/noise_removal.py

100755100644
File mode changed.

0 commit comments

Comments
 (0)