FSoft-AI4Code · minhna1112 · Jul 12, 2023 · Jun 12, 2023 · Jul 4, 2023 · Jul 4, 2023
diff --git a/HISTORY.md b/HISTORY.md
@@ -70,10 +70,18 @@ Release data: Dec 12, 2022
 
 Version 0.0.6
 =============
-Release data: Jan 9, 2022
+Release data: Jan 9, 2023
 
 * Add tree sitter utils (in codetext.parser)
 * Replace all `match_from_span` to `get_node_text`
 * Replace all `traverse_type` to `get_node_by_kind`
 * Fix `CppParser.get_function_metadata` missing `param_type` and `param_identifier`
 * Update return metadata from all parser
+
+Version 0.0.7
+=============
+Release data: Jul 5, 2023
+
+* Update all class extractor format (using dict instead of list)
+* Fix missing identifier, parameter in C, C#, Java parser
+* Implement CLI
diff --git a/README.md b/README.md
@@ -1,79 +1,152 @@
 <div align="center">
 
 <p align="center">
-  <img src="https://avatars.githubusercontent.com/u/115590550?s=200&v=4" width="220px" alt="logo">
+  <img src="./asset/img/codetext_logo.png" width="220px" alt="logo">
 </p>
-
-**CodeText-parser**
 ______________________________________________________________________
 
 
 <!-- Badge start -->
-| Branch 	| Build 	| Unittest 	| Linting 	| Release 	| License 	|
-|--------	|-------	|----------	|---------	|---------	|---------	|
-| main   	|       	| [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) |       	| [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
+| Branch 	| Build 	| Unittest 	| Release 	| License 	|
+|--------	|-------	|----------	|---------	|---------	|
+| main   	|       	| [![Unittest](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml/badge.svg)](https://github.com/AI4Code-Research/CodeText-parser/actions/workflows/unittest.yml) | [![release](https://img.shields.io/pypi/v/codetext)](https://pypi.org/project/codetext/) [![pyversion](https://img.shields.io/pypi/pyversions/codetext)](https://pypi.org/project/codetext/)| [![license](https://img.shields.io/github/license/AI4Code-Research/CodeText-parser)](https://github.com/AI4Code-Research/CodeText-parser/blob/main/LICENSES.txt) |
 <!-- Badge end -->
 </div>
 
 ______________________________________________________________________
 
-**Code-Text data toolkit** contains multilingual programming language parsers for the extract from raw source code into multiple levels of pair data (code-text) (e.g., function-level, class-level, inline-level). 
+**Code-Text parser** is a custom [tree-sitter](https://github.com/tree-sitter)'s grammar parser for extract raw source code into class and function level. We support 10 common programming languages:
+- Python
+- Java
+- JavaScript
+- PHP
+- Ruby
+- Rust
+- C
+- C++
+- C#
+- Go
 
 # Installation
-Setup environment and install dependencies and setup by using `install_env.sh`
-```bash
-bash -i ./install_env.sh
-```
-then activate conda environment named "code-text-env"
+**codetext** package require python 3.7 or above and tree-sitter. Setup environment and install dependencies manually from source:
 ```bash
-conda activate code-text-env
+git https://github.com/FSoft-AI4Code/CodeText-parser.git; cd CodeText-parser
+pip install -r requirement.txt
+pip install -e .
 ```
 
-*Setup for using parser*
+Or install via `pypi` package:
 ```bash
 pip install codetext
 ```
 
 # Getting started
 
-## Build your language
-Auto build tree-sitter into `<language>.so` located in `/tree-sitter/`
+## `codetext` CLI Usage
+```bash
+codetext [options] [PATH or FILE] ...
+```
+
+For example extract any python file in `src/` folder:
+```bash
+codetext src/ --language Python
+```
+
+If you want to store extracted class and function, use flag `--json` and give a path to destination file:
+```bash
+codetext src/ --language Python --output_file ./python_report.json --json
+```
+
+**Options**
+
+```bash
+positional arguments:
+  paths                 list of the filename/paths.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --version             show program's version number and exit
+  -l LANGUAGE, --language LANGUAGE
+                        Target the programming languages you want to analyze.
+  -o OUTPUT_FILE, --output_file OUTPUT_FILE
+                        Output file (e.g report.json).
+  --json                Generate json output as a transform of the default
+                        output
+  --verbose             Print progress bar
+
+```
+
+**Example**
+```
+File circle_linkedlist.py analyzed:
+==================================================
+Number of class    : 1
+Number of function : 2
+--------------------------------------------------
+
+Class summary:
++-----+---------+-------------+
+|   # | Class   | Arguments   |
++=====+=========+=============+
+|   0 | Node    |             |
++-----+---------+-------------+
+
+Class analyse: Node
++-----+---------------+-------------+--------+---------------+
+| #   | Method name   | Paramters   | Type   | Return type   |
++=====+===============+=============+========+===============+
+| 0   | __init__      | self        |        |               |
+|     |               | data        |        |               |
++-----+---------------+-------------+--------+---------------+
+
+Function analyse:
++-----+-----------------+-------------+--------+---------------+
+| #   | Function name   | Paramters   | Type   | Return type   |
++=====+=================+=============+========+===============+
+| 0   | push            | head_ref    |        | Node          |
+|     |                 | data        | Any    | Node          |
+| 1   | countNodes      | head        | Node   |               |
++-----+-----------------+-------------+--------+---------------+
+```
+
+## Using `codetext` as Python module
+### Build your language
+`codetext` need tree-sitter language file (i.e `.so` file) to work properly. You can manually compile language ([see more](https://github.com/tree-sitter/py-tree-sitter#usage)) or automatically build use our pre-defined function (the `<language>.so` will saved in a folder name `/tree-sitter/`):
 ```python
 from codetext.utils import build_language
 
 language = 'rust'
 build_language(language)
 
-
 # INFO:utils:Not found tree-sitter-rust, attempt clone from github
 # Cloning into 'tree-sitter-rust'...
 # remote: Enumerating objects: 2835, done. ...
 # INFO:utils:Attempt to build Tree-sitter Language for rust and store in .../tree-sitter/rust.so
 ```
 
-## Language Parser
-We supported 10 programming languages, namely `Python`, `Java`, `JavaScript`, `Golang`, `Ruby`, `PHP`, `C#`, `C++`, `C` and `Rust`.
+### Using Language Parser
+Each programming language we supported are correspond to a custome `language_parser`. (e.g Python is [`PythonParser()`](src/codetext/parser/python_parser.py#L11)). `language_parser` take input as raw source code and use breadth-first search to traveser through all syntax node. The class, method or stand-alone function will then be collected:
 
-Setup
 ```python
 from codetext.utils import parse_code
 
 raw_code = """
-/**
-* Sum of 2 number
-* @param a int number
-* @param b int number
-*/
-double sum2num(int a, int b) {
-    return a + b;
-}
+    /**
+    * Sum of 2 number
+    * @param a int number
+    * @param b int number
+    */
+    double sum2num(int a, int b) {
+        return a + b;
+    } 
 """
 
+# Auto parse code into tree-sitter.Tree
 root = parse_code(raw_code, 'cpp')
 root_node = root.root_node
 ```
 
-Get all function nodes inside a specific node, use:
+Get all function nodes inside a specific node:
 ```python
 from codetext.utils.parser import CppParser
 
@@ -105,3 +178,9 @@ class_list = CppParser.get_class_list(root_node)
 # and
 metadata = CppParser.get_metadata_list(root_node)
 ```
+
+# Limitations
+`codetext` heavly depends on tree-sitter syntax:
+- Since we use tree-sitter grammar to extract desire node like function, class, function's name (identifier) or class's argument list, etc. `codetext` is easily vulnerable by tree-sitter update patch or syntax change in future.
+
+- While we try our best to capture all possiblity, there are still plenty out there. We open for community to contribute into this project.
diff --git a/asset/img/codetext_logo.png b/asset/img/codetext_logo.png
diff --git a/asset/img/codetext_logo_line.png b/asset/img/codetext_logo_line.png
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "codetext"
-version = "0.0.5"
+version = "0.0.7"
 authors = [
   { name="Dung Manh Nguyen", email="[email protected]" },
 ]
@@ -21,8 +21,12 @@ dependencies = [
     "Levenshtein>=0.20",
     "langdetect>=1.0.0",
     "bs4>=0.0.1",
+    "tabulate>=0.9.0"
 ]
 
 [project.urls]
 "Homepage" = "https://github.com/AI4Code-Research/CodeText-data"
 "Bug Tracker" = "https://github.com/AI4Code-Research/CodeText-data/issues"
+
+[project.scripts]
+codetext = "codetext.__main__:main"
diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,6 @@
 # for preprocessing
 tree-sitter
-# docstring-parser
+tabulate
 Levenshtein
 langdetect
 bs4
diff --git a/src/codetext/__init__.py b/src/codetext/__init__.py
diff --git a/src/codetext/__main__.py b/src/codetext/__main__.py
@@ -0,0 +1,93 @@
+import os
+import sys
+import argparse
+import pkg_resources
+
+import json
+from .codetext_cli import parse_file, print_result, PL_MATCHING
+
+
+def get_args():
+    parser = argparse.ArgumentParser(description=f"codetext parser {20*'='}")
+
+    parser.add_argument('paths', nargs='*', default=['.'],
+                        help='list of the filename/paths.')
+    parser.add_argument("--version", action="version",
+                        version=pkg_resources.get_distribution("codetext").version)
+    parser.add_argument("-l", "--language",
+                        help='''Target the programming languages you want to
+                        analyze.''')
+    parser.add_argument("-o", "--output_file",
+                        help='''Output file (e.g report.json).
+                        ''',
+                        type=str)
+    parser.add_argument("--json",
+                        help='''Generate json output as a transform of the
+                        default output''',
+                        action="store_true")
+    parser.add_argument("--verbose",
+                        help='''Print progress bar''',
+                        action="store_true")
+
+    return parser.parse_args()
+
+
+def main():
+    opt = get_args()
+
+    # check args
+    if opt.json:
+        if not opt.output_file: 
+            raise ValueError("Missing --output_file")
+    if opt.language:
+        if opt.language not in PL_MATCHING.keys():
+            raise ValueError(
+                "{language} not supported. Currently support {sp_language}"
+                .format(language=opt.language, 
+                        sp_language=list(PL_MATCHING.keys())))
+
+    # check path
+    for path in opt.paths:
+        assert os.path.exists(path) == True, "paths is not valid"
+
+        if os.path.isdir(path):
+            files = [os.path.join(path, f) for f in os.listdir(path) \
+                    if os.path.isfile(os.path.join(path, f))]
+        elif os.path.isfile(path):
+            files = [path]
+
+        if opt.language:
+            for file in files[:]:
+                filename, file_extension = os.path.splitext(file)
+                if file_extension not in PL_MATCHING[opt.language]:
+                    files.remove(file)
+
+    output_metadata = {}
+    for file in files:
+        filename, file_extension = os.path.splitext(file)
+
+        if opt.language == None:
+            for lang, ext_list in PL_MATCHING.items():
+                if file_extension in ext_list:
+                    language = lang
+                    break
+        else:
+            language = opt.language
+
+        output = parse_file(file, language=language)
+        print_result(
+            output, 
+            file_name=str(filename).split(os.sep)[-1]+file_extension
+        )
+        output_metadata[file] = output
+
+    if opt.json:
+        save_path = opt.output_file
+        with open(save_path, 'w') as output_file:
+            json.dump(output_metadata, output_file, sort_keys=True, indent=4)
+            print(50*'=')
+            print("Save report to {path}".format(path=save_path))
+
+
+if __name__ == '__main__':
+    main()
diff --git a/src/codetext/clean/__init__.py b/src/codetext/clean/__init__.py
diff --git a/src/codetext/clean/noise_removal.py b/src/codetext/clean/noise_removal.py