|
1 | 1 | <div align="center">
|
2 | 2 |
|
3 | 3 | <p align="center">
|
4 |
| - <img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo"> |
| 4 | + <img src="./asset/img/codetext_logo.png" width="220px" alt="logo"> |
5 | 5 | </p>
|
6 |
| - |
7 |
| -**CodeText-parser** |
8 | 6 | ______________________________________________________________________
|
9 | 7 |
|
10 | 8 |
|
11 | 9 | <!-- Badge start -->
|
12 |
| -| Branch | Build | Unittest | Linting | Release | License | |
13 |
| -|-------- |------- |---------- |--------- |--------- |--------- | |
14 |
| -| main | | [](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | | [](https://pypi.org/project/codetext/) [](https://pypi.org/project/codetext/)| [](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) | |
| 10 | +| Branch | Build | Unittest | Release | License | |
| 11 | +|-------- |------- |---------- |--------- |--------- | |
| 12 | +| main | | [](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | [](https://pypi.org/project/codetext/) [](https://pypi.org/project/codetext/)| [](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) | |
15 | 13 | <!-- Badge end -->
|
16 | 14 | </div>
|
17 | 15 |
|
18 | 16 | ______________________________________________________________________
|
19 | 17 |
|
20 |
| -**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level). |
| 18 | +**Code-Text parser** is a custom [tree-sitter](https://github.com/tree-sitter)'s grammar parser for extract raw source code into class and function level. We support 10 common programming languages: |
| 19 | +- Python |
| 20 | +- Java |
| 21 | +- JavaScript |
| 22 | +- PHP |
| 23 | +- Ruby |
| 24 | +- Rust |
| 25 | +- C |
| 26 | +- C++ |
| 27 | +- C# |
| 28 | +- Go |
21 | 29 |
|
22 | 30 | # Installation
|
23 |
| -Setup environment and install dependencies and setup by using `install_env.sh` |
24 |
| -```bash |
25 |
| -bash -i ./install_env.sh |
26 |
| -``` |
27 |
| -then activate conda environment named "code-text-env" |
| 31 | +**codetext** package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source: |
28 | 32 | ```bash
|
29 |
| -conda activate code-text-env |
| 33 | +git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser |
| 34 | +pip install -r requirement.txt |
| 35 | +pip install -e . |
30 | 36 | ```
|
31 | 37 |
|
32 |
| -*Setup for using parser* |
| 38 | +Or install via `pypi` package: |
33 | 39 | ```bash
|
34 | 40 | pip install codetext
|
35 | 41 | ```
|
36 | 42 |
|
37 | 43 | # Getting started
|
38 | 44 |
|
39 |
| -## Build your language |
40 |
| -Auto build tree-sitter into `<language>.so` located in `/tree-sitter/` |
| 45 | +## `codetext` CLI Usage |
| 46 | +```bash |
| 47 | +codetext [options] [PATH or FILE] ... |
| 48 | +``` |
| 49 | + |
| 50 | +For example extract any python file in `src/` folder: |
| 51 | +```bash |
| 52 | +codetext src/ --language Python |
| 53 | +``` |
| 54 | + |
| 55 | +If you want to store extracted class and function, use flag `--json` and give a path to destination file: |
| 56 | +```bash |
| 57 | +codetext src/ --language Python --output_file ./python_report.json --json |
| 58 | +``` |
| 59 | + |
| 60 | +**Options** |
| 61 | + |
| 62 | +```bash |
| 63 | +positional arguments: |
| 64 | + paths list of the filename/paths. |
| 65 | + |
| 66 | +optional arguments: |
| 67 | + -h, --help show this help message and exit |
| 68 | + --version show program's version number and exit |
| 69 | + -l LANGUAGE, --language LANGUAGE |
| 70 | + Target the programming languages you want to analyze. |
| 71 | + -o OUTPUT_FILE, --output_file OUTPUT_FILE |
| 72 | + Output file (e.g report.json). |
| 73 | + --json Generate json output as a transform of the default |
| 74 | + output |
| 75 | + --verbose Print progress bar |
| 76 | +
|
| 77 | +``` |
| 78 | +
|
| 79 | +**Example** |
| 80 | +``` |
| 81 | +File circle_linkedlist.py analyzed: |
| 82 | +================================================== |
| 83 | +Number of class : 1 |
| 84 | +Number of function : 2 |
| 85 | +-------------------------------------------------- |
| 86 | +
|
| 87 | +Class summary: |
| 88 | ++-----+---------+-------------+ |
| 89 | +| # | Class | Arguments | |
| 90 | ++=====+=========+=============+ |
| 91 | +| 0 | Node | | |
| 92 | ++-----+---------+-------------+ |
| 93 | +
|
| 94 | +Class analyse: Node |
| 95 | ++-----+---------------+-------------+--------+---------------+ |
| 96 | +| # | Method name | Paramters | Type | Return type | |
| 97 | ++=====+===============+=============+========+===============+ |
| 98 | +| 0 | __init__ | self | | | |
| 99 | +| | | data | | | |
| 100 | ++-----+---------------+-------------+--------+---------------+ |
| 101 | +
|
| 102 | +Function analyse: |
| 103 | ++-----+-----------------+-------------+--------+---------------+ |
| 104 | +| # | Function name | Paramters | Type | Return type | |
| 105 | ++=====+=================+=============+========+===============+ |
| 106 | +| 0 | push | head_ref | | Node | |
| 107 | +| | | data | Any | Node | |
| 108 | +| 1 | countNodes | head | Node | | |
| 109 | ++-----+-----------------+-------------+--------+---------------+ |
| 110 | +``` |
| 111 | +
|
| 112 | +## Using `codetext` as Python module |
| 113 | +### Build your language |
| 114 | +`codetext` need tree-sitter language file (i.e `.so` file) to work properly. You can manually compile language ([see more](https://github.com/tree-sitter/py-tree-sitter#usage)) or automatically build use our pre-defined function (the `<language>.so` will saved in a folder name `/tree-sitter/`): |
41 | 115 | ```python
|
42 | 116 | from codetext.utils import build_language
|
43 | 117 |
|
44 | 118 | language = 'rust'
|
45 | 119 | build_language(language)
|
46 | 120 |
|
47 |
| - |
48 | 121 | # INFO:utils:Not found tree-sitter-rust, attempt clone from github
|
49 | 122 | # Cloning into 'tree-sitter-rust'...
|
50 | 123 | # remote: Enumerating objects: 2835, done. ...
|
51 | 124 | # INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
|
52 | 125 | ```
|
53 | 126 |
|
54 |
| -## Language Parser |
55 |
| -We supported 10 programming languages, namely `Python`, `Java`, `JavaScript`, `Golang`, `Ruby`, `PHP`, `C#`, `C++`, `C` and `Rust`. |
| 127 | +### Using Language Parser |
| 128 | +Each programming language we supported are correspond to a custome `language_parser`. (e.g Python is [`PythonParser()`](src/codetext/parser/python_parser.py#L11)). `language_parser` take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected: |
56 | 129 |
|
57 |
| -Setup |
58 | 130 | ```python
|
59 | 131 | from codetext.utils import parse_code
|
60 | 132 |
|
61 | 133 | raw_code = """
|
62 |
| -/** |
63 |
| -* Sum of 2 number |
64 |
| -* @param a int number |
65 |
| -* @param b int number |
66 |
| -*/ |
67 |
| -double sum2num(int a, int b) { |
68 |
| - return a + b; |
69 |
| -} |
| 134 | + /** |
| 135 | + * Sum of 2 number |
| 136 | + * @param a int number |
| 137 | + * @param b int number |
| 138 | + */ |
| 139 | + double sum2num(int a, int b) { |
| 140 | + return a + b; |
| 141 | + } |
70 | 142 | """
|
71 | 143 |
|
| 144 | +# Auto parse code into tree-sitter.Tree |
72 | 145 | root = parse_code(raw_code, 'cpp')
|
73 | 146 | root_node = root.root_node
|
74 | 147 | ```
|
75 | 148 |
|
76 |
| -Get all function nodes inside a specific node, use: |
| 149 | +Get all function nodes inside a specific node: |
77 | 150 | ```python
|
78 | 151 | from codetext.utils.parser import CppParser
|
79 | 152 |
|
@@ -105,3 +178,9 @@ class_list = CppParser.get_class_list(root_node)
|
105 | 178 | # and
|
106 | 179 | metadata = CppParser.get_metadata_list(root_node)
|
107 | 180 | ```
|
| 181 | + |
| 182 | +# Limitations |
| 183 | +`codetext` heavly depends on tree-sitter syntax: |
| 184 | +- Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. `codetext` is easily vulnerable by tree-sitter update patch or syntax change in future. |
| 185 | + |
| 186 | +- While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project. |
0 commit comments