diff --git a/docs/api/augment.rst b/docs/api/augment.rst index 220cc21c8..bff34aed3 100644 --- a/docs/api/augment.rst +++ b/docs/api/augment.rst @@ -1,25 +1,69 @@ .. currentmodule:: pythainlp.augment -pythainlp.augment -================= +pythainlp.augment Module +======================= -The :class:`textaugment` is Thai text augment. This function for text augment task. +Introduction +------------ -Modules -------- +The `pythainlp.augment` module is a powerful toolset for text augmentation in the Thai language. Text augmentation is a process that enriches and diversifies textual data by generating alternative versions of the original text. This module is a valuable resource for improving the quality and variety of Thai language data for NLP tasks. + +TextAugment Class +----------------- + +The central component of the `pythainlp.augment` module is the `TextAugment` class. This class provides various text augmentation techniques and functions to enhance the diversity of your text data. It offers the following methods: + +.. autoclass:: pythainlp.augment.TextAugment + :members: + +WordNetAug Class +---------------- + +The `WordNetAug` class is designed to perform text augmentation using WordNet, a lexical database for English. This class enables you to augment Thai text using English synonyms, offering a unique approach to text diversification. The following methods are available within this class: + +.. autoclass:: pythainlp.augment.WordNetAug + :members: + +Word2VecAug, Thai2fitAug, LTW2VAug Classes +------------------------------------------ + +The `pythainlp.augment.word2vec` package contains multiple classes for text augmentation using Word2Vec models. These classes include `Word2VecAug`, `Thai2fitAug`, and `LTW2VAug`. Each of these classes allows you to use Word2Vec embeddings to generate text variations. Explore the methods provided by these classes to understand their capabilities. -.. autoclass:: WordNetAug - :members: -.. autofunction:: postype2wordnet .. autoclass:: pythainlp.augment.word2vec.Word2VecAug - :members: + :members: + .. autoclass:: pythainlp.augment.word2vec.Thai2fitAug - :members: + :members: + .. autoclass:: pythainlp.augment.word2vec.LTW2VAug - :members: + :members: + +FastTextAug and Thai2transformersAug Classes +-------------------------------------------- + +The `pythainlp.augment.lm` package offers classes for text augmentation using language models. These classes include `FastTextAug` and `Thai2transformersAug`. These classes allow you to use language model-based techniques to diversify text data. Explore their methods to understand their capabilities. + .. autoclass:: pythainlp.augment.lm.FastTextAug - :members: + :members: + .. autoclass:: pythainlp.augment.lm.Thai2transformersAug - :members: + :members: + +BPEmbAug Class +-------------- + +The `pythainlp.augment.word2vec.bpemb_wv` package contains the `BPEmbAug` class, which is designed for text augmentation using subword embeddings. This class is particularly useful when working with subword representations for Thai text augmentation. + .. autoclass:: pythainlp.augment.word2vec.bpemb_wv.BPEmbAug - :members: \ No newline at end of file + :members: + +Additional Functions +------------------- + +To further enhance your text augmentation tasks, the `pythainlp.augment` module offers the following functions: + +- `postype2wordnet`: This function maps part-of-speech tags to WordNet-compatible POS tags, facilitating the integration of WordNet augmentation with Thai text. + +These functions and classes provide diverse techniques for text augmentation in the Thai language, making this module a valuable asset for NLP researchers, developers, and practitioners. + +For detailed usage examples and guidelines, please refer to the official PyThaiNLP documentation. The `pythainlp.augment` module opens up new possibilities for enriching and diversifying Thai text data, leading to improved NLP models and applications. diff --git a/docs/api/benchmarks.rst b/docs/api/benchmarks.rst index 418e53b6f..bf9e6047a 100644 --- a/docs/api/benchmarks.rst +++ b/docs/api/benchmarks.rst @@ -2,23 +2,43 @@ pythainlp.benchmarks ==================================== -The :class:`pythainlp.benchmarks` contains utility functions for benchmarking -tasked related to Thai NLP. At the moment, we have only for word tokenization. -Other tasks will be added soon. -Modules -------- +Introduction +------------ + +The `pythainlp.benchmarks` module is a collection of utility functions designed for benchmarking tasks related to Thai Natural Language Processing (NLP). Currently, the module includes tools for word tokenization benchmarking. Please note that additional benchmarking tasks will be incorporated in the future. Tokenization -********* +------------ + +Word tokenization is a fundamental task in NLP, and it plays a crucial role in various applications, such as text analysis and language processing. The `pythainlp.benchmarks` module offers a set of functions to assist in the benchmarking and evaluation of word tokenization methods. + +Quality Evaluation +^^^^^^^^^^^^^^^^^^ + +The quality of word tokenization can significantly impact the accuracy of downstream NLP tasks. To assess the quality of word tokenization, the module provides a qualitative evaluation using various metrics and techniques. -Quality -^^^^ .. figure:: ../images/evaluation.png :scale: 50 % Qualitative evaluation of word tokenization. +Functions +--------- + .. autofunction:: pythainlp.benchmarks.word_tokenization.compute_stats + + This function is used to compute various statistics and metrics related to word tokenization. It allows you to assess the performance of different tokenization methods. + .. autofunction:: pythainlp.benchmarks.word_tokenization.benchmark + + The `benchmark` function facilitates the benchmarking of word tokenization methods. It provides an organized framework for evaluating and comparing the effectiveness of different tokenization tools. + .. autofunction:: pythainlp.benchmarks.word_tokenization.preprocessing + + Preprocessing is a crucial step in NLP tasks. The `preprocessing` function assists in preparing text data for tokenization, which is essential for accurate and consistent benchmarking. + +Usage +----- + +To make use of these benchmarking functions, you can follow the provided examples and guidelines in the official PyThaiNLP documentation. These tools are invaluable for researchers, developers, and anyone interested in improving and evaluating Thai word tokenization methods. diff --git a/docs/api/coref.rst b/docs/api/coref.rst index daf5690bc..9a786364e 100644 --- a/docs/api/coref.rst +++ b/docs/api/coref.rst @@ -2,9 +2,37 @@ pythainlp.coref =============== -The :class:`pythainlp.coref` is Coreference Resolution for Thai. +Introduction +------------ + +The `pythainlp.coref` module is dedicated to Coreference Resolution for the Thai language. Coreference resolution is a crucial task in natural language processing (NLP) that deals with identifying and linking expressions (such as pronouns) in a text to the entities or concepts they refer to. This module provides tools to tackle coreference resolution challenges in the context of the Thai language. -Modules -------- +Coreference Resolution Function +------------------------------- + +The primary component of the `pythainlp.coref` module is the `coreference_resolution` function. This function is designed to analyze text and identify instances of coreference, helping NLP systems understand when different expressions in the text refer to the same entity. Here's how you can use it: + +The :class:`pythainlp.coref` is Coreference Resolution for Thai. .. autofunction:: coreference_resolution + +Usage +----- + +To use the `coreference_resolution` function effectively, follow these steps: + +1. Import the `coreference_resolution` function from the `pythainlp.coref` module. + +2. Pass the Thai text you want to analyze for coreferences as input to the function. + +3. The function will process the text and return information about coreference relationships within the text. + +Example: + +```python +from pythainlp.coref import coreference_resolution + +text = "นาย A มาจาก กรุงเทพ และเขา มีความรักต่อ บางกิจ ของเขา" +coreferences = coreference_resolution(text) + +print(coreferences) diff --git a/docs/api/corpus.rst b/docs/api/corpus.rst index b68ffacc3..6c5dbf72c 100644 --- a/docs/api/corpus.rst +++ b/docs/api/corpus.rst @@ -2,90 +2,280 @@ pythainlp.corpus ==================================== -The :class:`pythainlp.corpus` provides access to corpus that comes with PyThaiNLP. +The :class:`pythainlp.corpus` module provides access to various Thai language corpora and resources that come bundled with PyThaiNLP. These resources are essential for natural language processing tasks in the Thai language. Modules ------- +countries +~~~~~~~~~~ .. autofunction:: countries + :noindex: + +get_corpus +~~~~~~~~~~ .. autofunction:: get_corpus + :noindex: + +get_corpus_db +~~~~~~~~~~~~~~ .. autofunction:: get_corpus_db + :noindex: + +get_corpus_db_detail +~~~~~~~~~~~~~~~~~~~~ .. autofunction:: get_corpus_db_detail + :noindex: + +get_corpus_default_db +~~~~~~~~~~~~~~~~~~~~ .. autofunction:: get_corpus_default_db + :noindex: + +get_corpus_path +~~~~~~~~~~~~~~ .. autofunction:: get_corpus_path + :noindex: + +download +~~~~~~~~~~ .. autofunction:: download + :noindex: + +remove +~~~~~~~ .. autofunction:: remove + :noindex: + +provinces +~~~~~~~~~~ .. autofunction:: provinces + :noindex: + +thai_dict +~~~~~~~~~~ .. autofunction:: thai_dict + :noindex: + +thai_stopwords +~~~~~~~~~~~~~~ .. autofunction:: thai_stopwords + :noindex: + +thai_words +~~~~~~~~~~ .. autofunction:: thai_words + :noindex: + +thai_wsd_dict +~~~~~~~~~~~~~~ .. autofunction:: thai_wsd_dict + :noindex: + +thai_orst_words +~~~~~~~~~~~~~~~~~ .. autofunction:: thai_orst_words + :noindex: + +thai_synonym +~~~~~~~~~~~~~~ .. autofunction:: thai_synonym + :noindex: + +thai_syllables +~~~~~~~~~~~~~~ .. autofunction:: thai_syllables + :noindex: + +thai_negations +~~~~~~~~~~~~~~ .. autofunction:: thai_negations + :noindex: + +thai_family_names +~~~~~~~~~~~~~~~~~~~ .. autofunction:: thai_family_names + :noindex: + +thai_female_names +~~~~~~~~~~~~~~~~~~~ .. autofunction:: thai_female_names + :noindex: + +thai_male_names +~~~~~~~~~~~~~~~~ .. autofunction:: thai_male_names + :noindex: + +pythainlp.corpus.th_en_translit.get_transliteration_dict +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.th_en_translit.get_transliteration_dict + :noindex: ConceptNet ---------- -ConceptNet is an open, multilingual knowledge graph -See: https://github.com/commonsense/conceptnet5/wiki/API +ConceptNet is an open, multilingual knowledge graph used for various natural language understanding tasks. For more information, refer to the `ConceptNet documentation `_. +pythainlp.corpus.conceptnet.edges +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.conceptnet.edges + :noindex: -TNC +TNC (Thai National Corpus) --- +The Thai National Corpus (TNC) is a collection of text data in the Thai language. This module provides access to word frequency data from the TNC corpus. + +pythainlp.corpus.tnc.word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.word_freqs + :noindex: + +pythainlp.corpus.tnc.unigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.unigram_word_freqs + :noindex: + +pythainlp.corpus.tnc.bigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.bigram_word_freqs + :noindex: + +pythainlp.corpus.tnc.trigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.tnc.trigram_word_freqs + :noindex: -TTC +TTC (Thai Textbook Corpus) --- +The Thai Textbook Corpus (TTC) is a collection of Thai language text data, primarily sourced from textbooks. + +pythainlp.corpus.ttc.word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.ttc.word_freqs + :noindex: + +pythainlp.corpus.ttc.unigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.ttc.unigram_word_freqs + :noindex: OSCAR ----- +OSCAR is a multilingual corpus that includes Thai text data. This module provides access to word frequency data from the OSCAR corpus. + +pythainlp.corpus.oscar.word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.oscar.word_freqs + :noindex: + +pythainlp.corpus.oscar.unigram_word_freqs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.oscar.unigram_word_freqs + :noindex: Util ---- +Utilities for working with the corpus data. + +pythainlp.corpus.util.find_badwords +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.util.find_badwords + :noindex: + +pythainlp.corpus.util.revise_wordset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.util.revise_wordset + :noindex: + +pythainlp.corpus.util.revise_newmm_default_wordset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.util.revise_newmm_default_wordset + :noindex: WordNet ------- -PyThaiNLP API is an exact copy of NLTK WordNet API. -See: https://www.nltk.org/howto/wordnet.html +PyThaiNLP API includes the WordNet module, which is an exact copy of NLTK's WordNet API for the Thai language. WordNet is a lexical database for English and other languages. + +For more details on WordNet, refer to the `NLTK WordNet documentation `_. +pythainlp.corpus.wordnet.synsets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.synsets + :noindex: + +pythainlp.corpus.wordnet.synset +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.synset + :noindex: + +pythainlp.corpus.wordnet.all_lemma_names +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.all_lemma_names + :noindex: + +pythainlp.corpus.wordnet.all_synsets +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.all_synsets + :noindex: + +pythainlp.corpus.wordnet.langs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.langs + :noindex: + +pythainlp.corpus.wordnet.lemmas +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lemmas + :noindex: + +pythainlp.corpus.wordnet.lemma +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lemma + :noindex: + +pythainlp.corpus.wordnet.lemma_from_key +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lemma_from_key + :noindex: + +pythainlp.corpus.wordnet.path_similarity +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.path_similarity + :noindex: + +pythainlp.corpus.wordnet.lch_similarity +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.lch_similarity + :noindex: + +pythainlp.corpus.wordnet.wup_similarity +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.wup_similarity + :noindex: + +pythainlp.corpus.wordnet.morphy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.morphy + :noindex: + +pythainlp.corpus.wordnet.custom_lemmas +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.corpus.wordnet.custom_lemmas + :noindex: Definition ++++++++++ Synset - a set of synonyms that share a common meaning. +~~~~~~~ +A synset is a set of synonyms that share a common meaning. The WordNet module provides functionality to work with these synsets. + +This documentation is designed to help you navigate and use the various resources and modules available in the `pythainlp.corpus` package effectively. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support. + +We hope you find this documentation helpful for your natural language processing tasks in the Thai language. diff --git a/docs/api/el.rst b/docs/api/el.rst index bd88abc15..36d24d1bf 100644 --- a/docs/api/el.rst +++ b/docs/api/el.rst @@ -2,7 +2,53 @@ pythainlp.el ============ -The :class:`pythainlp.el` is Thai Entity Linking with PyThaiNLP. +The :class:`pythainlp.el` module is an essential component of Thai Entity Linking within the PyThaiNLP library. Entity Linking is a key natural language processing task that associates mentions in text with corresponding entities in a knowledge base. .. autoclass:: EntityLinker :members: + +EntityLinker +------------ + +The :class:`EntityLinker` class is the core component of the `pythainlp.el` module, responsible for Thai Entity Linking. Entity Linking, also known as Named Entity Linking (NEL), plays a critical role in various applications, including question answering, information retrieval, and knowledge graph construction. + +Attributes and Methods +~~~~~~~~~~~~~~~~~~~~~~ + +The `EntityLinker` class offers the following attributes and methods: + +- `__init__(text, engine="default")` + - The constructor for the `EntityLinker` class. It takes the input `text` and an optional `engine` parameter to specify the entity linking engine. The default engine is used if no specific engine is provided. + +- `link()` + - The `link` method performs entity linking on the input text using the specified engine. It returns a list of entities linked in the text, along with their relevant information. + +- `set_engine(engine)` + - The `set_engine` method allows you to change the entity linking engine during runtime. This provides flexibility in selecting different engines for entity linking based on your specific requirements. + +- `get_linked_entities()` + - The `get_linked_entities` method retrieves a list of linked entities from the last entity linking operation. This is useful for extracting the entities found in the text. + +Usage +~~~~~ + +To use the `EntityLinker` class for entity linking, follow these steps: + +1. Initialize an `EntityLinker` object with the input text and, optionally, specify the engine. + +2. Call the `link` method to perform entity linking on the text. + +3. Utilize the `get_linked_entities` method to access the linked entities found in the text. + +Example +~~~~~~~ + +Here's a simple example of how to use the `EntityLinker` class: + +```python +from pythainlp.el import EntityLinker + +text = "Bangkok is the capital of Thailand." +el = EntityLinker(text) +linked_entities = el.link() +print(linked_entities) diff --git a/docs/api/generate.rst b/docs/api/generate.rst index 910bba27d..d0c80580a 100644 --- a/docs/api/generate.rst +++ b/docs/api/generate.rst @@ -2,17 +2,71 @@ pythainlp.generate ================== -The :class:`pythainlp.generate` is Thai text generate with PyThaiNLP. +The :class:`pythainlp.generate` module is a powerful tool for generating Thai text using PyThaiNLP. It includes several classes and functions that enable users to create text based on various language models and n-gram models. Modules ------- +Unigram +~~~~~~~ .. autoclass:: Unigram - :members: + :members: + +The :class:`Unigram` class provides functionality for generating text based on unigram language models. Unigrams are single words or tokens, and this class allows you to create text by selecting words probabilistically based on their frequencies in the training data. + +Bigram +~~~~~~ .. autoclass:: Bigram - :members: + :members: + +The :class:`Bigram` class is designed for generating text using bigram language models. Bigrams are sequences of two words, and this class enables you to generate text by predicting the next word based on the previous word's probability. + +Trigram +~~~~~~~ .. autoclass:: Trigram - :members: + :members: + +The :class:`Trigram` class extends text generation to trigram language models. Trigrams consist of three consecutive words, and this class facilitates the creation of text by predicting the next word based on the two preceding words' probabilities. + +pythainlp.generate.thai2fit.gen_sentence +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.generate.thai2fit.gen_sentence + :noindex: + +The function :func:`pythainlp.generate.thai2fit.gen_sentence` offers a convenient way to generate sentences using the Thai2Vec language model. It takes a seed text as input and generates a coherent sentence based on the provided context. + +pythainlp.generate.wangchanglm.WangChanGLM +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: pythainlp.generate.wangchanglm.WangChanGLM - :members: \ No newline at end of file + :members: + +The :class:`WangChanGLM` class is a part of the `pythainlp.generate.wangchanglm` module, offering text generation capabilities. It includes methods for creating text using the WangChanGLM language model. + +Usage +~~~~~ + +To use the text generation capabilities provided by the `pythainlp.generate` module, follow these steps: + +1. Select the appropriate class or function based on the type of language model you want to use (Unigram, Bigram, Trigram, Thai2Vec, or WangChanGLM). + +2. Initialize the selected class or use the function with the necessary parameters. + +3. Call the appropriate methods to generate text based on the chosen model. + +4. Utilize the generated text for various applications, such as chatbots, content generation, and more. + +Example +~~~~~~~ + +Here's a simple example of how to generate text using the `Unigram` class: + +```python +from pythainlp.generate import Unigram + +# Initialize the Unigram model +unigram = Unigram() + +# Generate a sentence +sentence = unigram.gen_sentence(seed="สวัสดีครับ") + +print(sentence) diff --git a/docs/api/khavee.rst b/docs/api/khavee.rst index 71983bcd1..591ec79fd 100644 --- a/docs/api/khavee.rst +++ b/docs/api/khavee.rst @@ -2,11 +2,62 @@ pythainlp.khavee ================ -The :class:`pythainlp.khavee` is toolkit for Thai Poetry. `khavee` is `กวี` (or Poetry) in Thai language. +The :class:`pythainlp.khavee` module is a powerful toolkit designed for working with Thai poetry. The term "khavee" corresponds to "กวี" in the Thai language, which translates to "Poetry" in English. This toolkit equips users with the tools and utilities necessary for the creation, analysis, and verification of Thai poetry. Modules ------- +KhaveeVerifier +~~~~~~~~~~~~~~ .. autoclass:: KhaveeVerifier :special-members: :members: + +The :class:`KhaveeVerifier` class is the primary component of the `pythainlp.khavee` module, dedicated to the verification of Thai poetry. It offers a range of functions and methods for analyzing and validating Thai poetry, ensuring its adherence to the rules and structure of classical Thai poetic forms. + +Attributes and Methods +~~~~~~~~~~~~~~~~~~~~~~ + +The `KhaveeVerifier` class provides a variety of attributes and methods to facilitate the verification of Thai poetry. Some of its key features include: + +- `__init__(rules: dict = None, stanza_rules: dict = None, verbose: bool = False)` + - The constructor for the `KhaveeVerifier` class, allowing you to initialize an instance with custom rules, stanza rules, and verbosity settings. + +- `is_khavee(text: str, rules: dict = None)` + - The `is_khavee` method checks whether a given text conforms to the rules of Thai poetry. It returns `True` if the text is a valid Thai poem according to the specified rules, and `False` otherwise. + +- `get_rules()` + - The `get_rules` method retrieves the current set of rules being used by the verifier. This is helpful for inspecting and modifying the rules during runtime. + +- `set_rules(rules: dict)` + - The `set_rules` method allows you to set custom rules for the verifier, offering flexibility in defining specific constraints for Thai poetry. + +Usage +~~~~~ + +To use the `KhaveeVerifier` class for Thai poetry verification, follow these steps: + +1. Initialize an instance of the `KhaveeVerifier` class, optionally specifying custom rules and verbosity settings. + +2. Use the `is_khavee` method to verify whether a given text adheres to the rules of Thai poetry. The method returns a Boolean value indicating the result. + +3. Utilize the `get_rules` and `set_rules` methods to inspect and modify the rules as needed. + +Example +~~~~~~~ + +Here's a basic example of how to use the `KhaveeVerifier` class to verify Thai poetry: + +```python +from pythainlp.khavee import KhaveeVerifier + +# Initialize a KhaveeVerifier instance +verifier = KhaveeVerifier() + +# Text to verify +poem_text = "ดอกไม้สวยงาม แสนสดใส" + +# Verify if the text is Thai poetry +is_poetry = verifier.is_khavee(poem_text) + +print(f"The provided text is Thai poetry: {is_poetry}") diff --git a/docs/api/parse.rst b/docs/api/parse.rst index db1ea47b6..93bb4d552 100644 --- a/docs/api/parse.rst +++ b/docs/api/parse.rst @@ -2,9 +2,39 @@ pythainlp.parse =============== -The :class:`pythainlp.parse` is dependency parsing for Thai. +The :class:`pythainlp.parse` module provides dependency parsing for the Thai language. Dependency parsing is a fundamental task in natural language processing that involves identifying the grammatical relationships between words in a sentence, which helps to analyze sentence structure and meaning. Modules ------- +dependency_parsing +~~~~~~~~~~~~~~~~~ .. autofunction:: dependency_parsing + +The `dependency_parsing` function is the core component of the `pythainlp.parse` module. It offers dependency parsing capabilities for the Thai language. Given a Thai sentence as input, this function parses the sentence to identify the grammatical relationships between words, creating a dependency tree that represents the sentence's structure. + +Usage +~~~~~ + +To use the `dependency_parsing` function for Thai dependency parsing, follow these steps: + +1. Import the `pythainlp.parse` module. +2. Use the `dependency_parsing` function with a Thai sentence as input. +3. The function will return the dependency parsing results, which include information about the grammatical relationships between words. + +Example +~~~~~~~ + +Here's a basic example of how to use the `dependency_parsing` function: + +```python +from pythainlp.parse import dependency_parsing + +# Input Thai sentence +sentence = "พี่น้องชาวบ้านกำลังเลี้ยงสตางค์ในสวน" + +# Perform dependency parsing +parsing_result = dependency_parsing(sentence) + +# Print the parsing result +print(parsing_result) diff --git a/docs/api/soundex.rst b/docs/api/soundex.rst index 139fadd02..66ae95e07 100644 --- a/docs/api/soundex.rst +++ b/docs/api/soundex.rst @@ -1,31 +1,69 @@ .. currentmodule:: pythainlp.soundex pythainlp.soundex -==================================== -The :class:`pythainlp.soundex` is soundex for Thai. +================ +The :class:`pythainlp.soundex` module provides soundex algorithms for the Thai language. Soundex is a phonetic algorithm used to encode words or names into a standardized representation based on their pronunciation, making it useful for tasks like name matching and search. Modules ------- +soundex +~~~~~~~ .. autofunction:: soundex + +The `soundex` function is a basic Soundex algorithm for the Thai language. It encodes a Thai word into a Soundex code, allowing for approximate matching of words with similar pronunciation. + +lk82 +~~~~ .. autofunction:: lk82 + +The `lk82` module implements the Thai Soundex algorithm proposed by Vichit Lorjai in 1982. This module is suitable for encoding Thai words into Soundex codes for phonetic comparisons. + +udom83 +~~~~~~ .. autofunction:: udom83 + +The `udom83` module is based on a homonymic approach for sound-alike string search. It encodes Thai words using the Udompanich Soundex algorithm developed in 1983. + +metasound +~~~~~~~~~ .. autofunction:: metasound + +The `metasound` module implements a novel phonetic name matching algorithm with a statistical ontology for analyzing names based on Thai astrology. It offers advanced phonetic matching capabilities for Thai names. + +prayut_and_somchaip +~~~~~~~~~~~~~~~~~~~ .. autofunction:: prayut_and_somchaip + +The `prayut_and_somchaip` module is designed for Thai-English cross-language transliterated word retrieval using the Soundex technique. It is particularly useful for matching transliterated words in both languages. + +pythainlp.soundex.sound.word_approximation +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.soundex.sound.word_approximation + +The `pythainlp.soundex.sound.word_approximation` module offers word approximation functionality. It allows users to find Thai words that are phonetically similar to a given word. + +pythainlp.soundex.sound.audio_vector +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.soundex.sound.audio_vector + +The `pythainlp.soundex.sound.audio_vector` module provides audio vector functionality for Thai words. It allows users to work with audio vectors based on phonetic properties. + +pythainlp.soundex.sound.word2audio +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autofunction:: pythainlp.soundex.sound.word2audio +The `pythainlp.soundex.sound.word2audio` module is designed for converting Thai words to audio representations. It enables users to obtain audio vectors for Thai words, which can be used for various applications. + References ---------- +.. [#metasound] Snae & Brückner. (2009). `Novel Phonetic Name Matching Algorithm with a Statistical Ontology for Analyzing Names Given in Accordance with Thai Astrology `_. -.. [#metasound] Snae & Brückner. (2009). `Novel Phonetic Name Matching Algorithm with a Statistical - Ontology for Analysing Names Given in Accordance with Thai Astrology `_. - -.. [#udom83] Wannee Udompanich (1983). Search Thai sound-alike string using homonymic approach. - Master Thesis. Chulalongkorn University, Thailand. +.. [#udom83] Wannee Udompanich (1983). Search Thai sound-alike string using homonymic approach. Master Thesis. Chulalongkorn University, Thailand. .. [#lk82] วิชิต หล่อจีระชุณห์กุล และ เจริญ คุวินทร์พันธุ์. `โปรแกรมการสืบค้นคำไทยตามเสียงอ่าน (Thai Soundex) `_. -.. [#prayut_and_somchaip] Prayut Suwanvisat, Somchai Prasitjutrakul. Thai-English Cross-Language Transliterated Word Retrieval using Soundex Technique. In 1998 [cited 2022 Sep 8]. Available from: https://www.cp.eng.chula.ac.th/~somchai/spj/papers/ThaiText/ncsec98-clir.pdf +.. [#prayut_and_somchaip] Prayut Suwanvisat, Somchai Prasitjutrakul. Thai-English Cross-Language Transliterated Word Retrieval using Soundex Technique. In 1998 [cited 2022 Sep 8]. Available from: https://www.cp.eng.chula.ac.th/~somchai/spj/papers/ThaiText/ncsec98-clir.pdf. + +This enhanced documentation provides clear descriptions of all the modules within the `pythainlp.soundex` module, including their purposes and functionalities. Users can now better understand how to leverage these soundex algorithms for various phonetic matching tasks in the Thai language. diff --git a/docs/api/spell.rst b/docs/api/spell.rst index cad3f7faf..c28fca95e 100644 --- a/docs/api/spell.rst +++ b/docs/api/spell.rst @@ -1,23 +1,54 @@ .. currentmodule:: pythainlp.spell pythainlp.spell -===================================== -The :class:`pythainlp.spell` finds the closest correctly spelled word to the given text. +=============== +The :class:`pythainlp.spell` module is a powerful tool for finding the closest correctly spelled word to a given text in the Thai language. It provides functionalities to correct spelling errors and enhance the accuracy of text processing. Modules ------- +correct +~~~~~~~ .. autofunction:: correct + +The `correct` function is designed to correct the spelling of a single Thai word. Given an input word, this function returns the closest correctly spelled word from the dictionary, making it valuable for spell-checking and text correction tasks. + +correct_sent +~~~~~~~~~~~~ .. autofunction:: correct_sent + +The `correct_sent` function is an extension of the `correct` function and is used to correct an entire sentence. It tokenizes the input sentence, corrects each word, and returns the corrected sentence. This is beneficial for proofreading and improving the readability of Thai text. + +spell +~~~~~ .. autofunction:: spell + +The `spell` function is responsible for identifying spelling errors within a given Thai word. It checks whether the input word is spelled correctly or not and returns a Boolean result. This function is useful for validating the correctness of Thai words. + +spell_sent +~~~~~~~~~~ .. autofunction:: spell_sent + +The `spell_sent` function extends the spell-checking functionality to entire sentences. It tokenizes the input sentence and checks the spelling of each word. It returns a list of Booleans indicating whether each word in the sentence is spelled correctly or not. + +NorvigSpellChecker +~~~~~~~~~~~~~~~~~~ .. autoclass:: NorvigSpellChecker :special-members: :members: + +The `NorvigSpellChecker` class is a fundamental component of the `pythainlp.spell` module. It implements a spell-checking algorithm based on the work of Peter Norvig. This class is designed for more advanced spell-checking and provides customizable settings for spell correction. + +DEFAULT_SPELL_CHECKER +~~~~~~~~~~~~~~~~~~~~~ .. autodata:: DEFAULT_SPELL_CHECKER - :annotation: = Default instance of standard NorvigSpellChecker, using word list from Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/ + :annotation: = Default instance of the standard NorvigSpellChecker, using word list data from the Thai National Corpus: http://www.arts.chula.ac.th/ling/tnc/ + +The `DEFAULT_SPELL_CHECKER` is an instance of the `NorvigSpellChecker` class with default settings. It is pre-configured to use word list data from the Thai National Corpus, making it a reliable choice for general spell-checking tasks. References ---------- .. [#norvig_spellchecker] Peter Norvig (2007). `How to Write a Spelling Corrector `_. + +This enhanced documentation provides a clear introduction to the `pythainlp.spell` module, its purpose, and the functionalities it offers for Thai text spell-checking. It also includes detailed descriptions of the functions and classes, their purposes, and how to use them effectively. Users can now understand how to leverage this module for spell-checking and text correction in the Thai language. If you have any questions or need further assistance, please refer to the PyThaiNLP documentation or reach out to the PyThaiNLP community for support. diff --git a/docs/api/tokenize.rst b/docs/api/tokenize.rst index 67c11b3d6..4dc9493e6 100644 --- a/docs/api/tokenize.rst +++ b/docs/api/tokenize.rst @@ -3,97 +3,173 @@ pythainlp.tokenize ===================================== -The :class:`pythainlp.tokenize` contains multiple functions for tokenizing a chunk of Thai text into desirable units. +The :mod:`pythainlp.tokenize` module contains a comprehensive set of functions and classes for tokenizing Thai text into various units, such as sentences, words, subwords, and more. This module is a fundamental component of the PyThaiNLP library, providing tools for natural language processing in the Thai language. Modules ------- .. autofunction:: clause_tokenize + :noindex: + + Tokenizes text into clauses. This function allows you to split text into meaningful sections, making it useful for more advanced text processing tasks. + .. autofunction:: sent_tokenize + :noindex: + + Splits Thai text into sentences. This function identifies sentence boundaries, which is essential for text segmentation and analysis. + .. autofunction:: paragraph_tokenize + :noindex: + + Segments text into paragraphs, which can be valuable for document-level analysis or summarization. + .. autofunction:: subword_tokenize + :noindex: + + Tokenizes text into subwords, which can be helpful for various NLP tasks, including subword embeddings. + .. autofunction:: syllable_tokenize + :noindex: + + Divides text into syllables, allowing you to work with individual Thai language phonetic units. + .. autofunction:: word_tokenize + :noindex: + + Splits text into words. This function is a fundamental tool for Thai language text analysis. + .. autofunction:: word_detokenize + :noindex: + + Reverses the tokenization process, reconstructing text from tokenized units. Useful for text generation tasks. + .. autoclass:: Tokenizer - :members: + :members: + + The `Tokenizer` class is a versatile tool for customizing tokenization processes and managing tokenization models. It provides various methods and attributes to fine-tune tokenization according to your specific needs. Tokenization Engines -------------------- +This module offers multiple tokenization engines designed for different levels of text analysis. + Sentence level -------------- -crfcut ------- -.. automodule:: pythainlp.tokenize.crfcut +**crfcut** + +.. automodule:: pythainlp.tokenize.crfcut + :members: + + A tokenizer that operates at the sentence level using Conditional Random Fields (CRF). It is suitable for segmenting text into sentences accurately. -thaisumcut ----------- -.. automodule:: pythainlp.tokenize.thaisumcut +**thaisumcut** + +.. automodule:: pythainlp.tokenize.thaisumcut + :members: + + A sentence tokenizer based on a maximum entropy model. It's a great choice for sentence boundary detection in Thai text. Word level ---------- -attacut -+++++++ -.. automodule:: pythainlp.tokenize.attacut - -deepcut -+++++++ -.. automodule:: pythainlp.tokenize.deepcut - -multi_cut -+++++++++ -.. automodule:: pythainlp.tokenize.multi_cut - -nlpo3 -+++++ -.. automodule:: pythainlp.tokenize.nlpo3 - -longest -+++++++ -.. automodule:: pythainlp.tokenize.longest - -pyicu -+++++ -.. automodule:: pythainlp.tokenize.pyicu - -nercut -++++++ -.. automodule:: pythainlp.tokenize.nercut - -sefr_cut -++++++++ -.. automodule:: pythainlp.tokenize.sefr_cut - -oskut -+++++ -.. automodule:: pythainlp.tokenize.oskut - -newmm -+++++ - -The default word tokenization engine. - -.. automodule:: pythainlp.tokenize.newmm - +**attacut** + +.. automodule:: pythainlp.tokenize.attacut + :members: + + A tokenizer designed for word-level segmentation. It provides accurate word boundary detection in Thai text. + +**deepcut** + +.. automodule:: pythainlp.tokenize.deepcut + :members: + + Utilizes deep learning techniques for word segmentation, achieving high accuracy and performance. + +**multi_cut** + +.. automodule:: pythainlp.tokenize.multi_cut + :members: + + An ensemble tokenizer that combines multiple tokenization strategies for improved word segmentation. + +**nlpo3** + +.. automodule:: pythainlp.tokenize.nlpo3 + :members: + + A word tokenizer based on the NLPO3 model. It offers advanced word boundary detection and is suitable for various NLP tasks. + +**longest** + +.. automodule:: pythainlp.tokenize.longest + :members: + + A tokenizer that identifies word boundaries by selecting the longest possible words in a text. + +**pyicu** + +.. automodule:: pythainlp.tokenize.pyicu + :members: + + An ICU-based word tokenizer offering robust support for Thai text segmentation. + +**nercut** + +.. automodule:: pythainlp.tokenize.nercut + :members: + + A tokenizer optimized for Named Entity Recognition (NER) tasks, ensuring accurate tokenization for entity recognition. + +**sefr_cut** + +.. automodule:: pythainlp.tokenize.sefr_cut + :members: + + An advanced word tokenizer for segmenting Thai text, with a focus on precision. + +**oskut** + +.. automodule:: pythainlp.tokenize.oskut + :members: + + A tokenizer that uses a pre-trained model for word segmentation. It's a reliable choice for general-purpose text analysis. + +**newmm (Default)** + +.. automodule:: pythainlp.tokenize.newmm + :members: + + The default word tokenization engine that provides a balance between accuracy and efficiency for most use cases. Subword level ------------- -tcc -+++ +**tcc** + .. automodule:: pythainlp.tokenize.tcc + :members: + + Tokenizes text into Thai Character Clusters (TCCs), a subword level representation. -tcc+ -++++ +**tcc+** + .. automodule:: pythainlp.tokenize.tcc_p + :members: + + A subword tokenizer that includes additional rules for more precise subword segmentation. -etcc -++++ +**etcc** + .. automodule:: pythainlp.tokenize.etcc - -han_solo -++++++++ -.. automodule:: pythainlp.tokenize.han_solo \ No newline at end of file + :members: + + Enhanced Thai Character Clusters (eTCC) tokenizer for subword-level analysis. + +**han_solo** + +.. automodule:: pythainlp.tokenize.han_solo + :members: + + A subword tokenizer specialized for Han characters and mixed scripts, suitable for various text processing scenarios. diff --git a/docs/api/tools.rst b/docs/api/tools.rst index 03879cd0c..f852f010f 100644 --- a/docs/api/tools.rst +++ b/docs/api/tools.rst @@ -2,12 +2,29 @@ pythainlp.tools ==================================== -The :class:`pythainlp.tools` contains miscellaneous functions for PyThaiNLP internal use. +The :mod:`pythainlp.tools` module encompasses a collection of miscellaneous functions primarily designed for internal use within the PyThaiNLP library. While these functions may not be directly exposed for external use, understanding their purpose can offer insights into the inner workings of PyThaiNLP. Modules ------- .. autofunction:: get_full_data_path + :noindex: + + Retrieves the full path to the PyThaiNLP data directory. This function is essential for internal data management, enabling PyThaiNLP to locate resources efficiently. + .. autofunction:: get_pythainlp_data_path + :noindex: + + Obtains the path to the PyThaiNLP data directory. This function is useful for accessing the library's data resources for internal processes. + .. autofunction:: get_pythainlp_path + :noindex: + + Returns the path to the PyThaiNLP library directory. This function is vital for PyThaiNLP's internal operations and library management. + .. autofunction:: pythainlp.tools.misspell.misspell + :noindex: + + This module appears to be related to handling misspellings within PyThaiNLP. While not explicitly documented here, it likely provides functionality for identifying and correcting misspelled words, which can be crucial for text preprocessing and language processing tasks. + +The `pythainlp.tools` module contains these functions, which are mainly intended for PyThaiNLP's internal workings. While they may not be directly utilized by external users, they play a pivotal role in ensuring the smooth operation of the library. Understanding the purpose of these functions can be valuable for contributors and developers working on PyThaiNLP, as it sheds light on the internal mechanisms and data management within the library. diff --git a/docs/api/translate.rst b/docs/api/translate.rst index 4662fea59..5bb252bbd 100644 --- a/docs/api/translate.rst +++ b/docs/api/translate.rst @@ -2,16 +2,44 @@ pythainlp.translate =================== -The :class:`pythainlp.translate` for machine translation. +The :mod:`pythainlp.translate` module is dedicated to machine translation capabilities for the PyThaiNLP library. It provides tools for translating text between different languages, making it a valuable resource for natural language processing tasks. Modules ------- .. autoclass:: Translate :members: + + The `Translate` class is the central component of the module, offering a unified interface for various translation tasks. It acts as a coordinator, directing translation requests to specific language pairs and models. + .. autofunction:: pythainlp.translate.en_th.download_model_all + :noindex: + + This function facilitates the download of all available English to Thai translation models. It ensures that the required models are accessible for translation tasks, enhancing the usability of the module. + .. autoclass:: pythainlp.translate.en_th.EnThTranslator + :members: + + The `EnThTranslator` class specializes in translating text from English to Thai. It offers a range of methods for translating sentences and text, enabling accurate and meaningful translations between these languages. + .. autoclass:: pythainlp.translate.en_th.ThEnTranslator + :members: + + Conversely, the `ThEnTranslator` class focuses on translating text from Thai to English. It provides functionality for translating Thai text into English, contributing to effective language understanding and communication. + .. autoclass:: pythainlp.translate.zh_th.ThZhTranslator + :members: + + The `ThZhTranslator` class specializes in translating text from Thai to Chinese (Simplified). This class is valuable for bridging language gaps between these two languages, promoting cross-cultural communication. + .. autoclass:: pythainlp.translate.zh_th.ZhThTranslator + :members: + + The `ZhThTranslator` class is designed for translating text from Chinese (Simplified) to Thai. It assists in making content accessible to Thai-speaking audiences by converting Chinese text into Thai. + .. autoclass:: pythainlp.translate.th_fr.ThFrTranslator + :members: + + Lastly, the `ThFrTranslator` class specializes in translating text from Thai to French. It serves as a tool for expanding language accessibility and promoting content sharing in French-speaking communities. + +The `pythainlp.translate` module extends the language processing capabilities of PyThaiNLP, offering machine translation functionality for various language pairs. Whether you need to translate text between English and Thai, Thai and Chinese, or Thai and French, this module provides the necessary tools and classes to facilitate seamless language conversion. The `Translate` class acts as the central coordinator, while language-specific classes ensure accurate and meaningful translations for diverse linguistic scenarios. diff --git a/docs/api/transliterate.rst b/docs/api/transliterate.rst index ca7eeba8d..e95c9dca1 100644 --- a/docs/api/transliterate.rst +++ b/docs/api/transliterate.rst @@ -2,60 +2,67 @@ pythainlp.transliterate ==================================== -The :class:`pythainlp.transliterate` turns Thai text into a romanized one (put simply, spelled with English). +The :mod:`pythainlp.transliterate` module is dedicated to the transliteration of Thai text into romanized form, effectively spelling it out with the English alphabet. This functionality is invaluable for making Thai text more accessible to non-Thai speakers and for various language processing tasks. Modules ------- .. autofunction:: romanize + :noindex: + + The `romanize` function allows you to transliterate Thai text, converting it into a phonetic representation using the English alphabet. It's a fundamental tool for rendering Thai words and phrases in a more familiar format. + .. autofunction:: transliterate + :noindex: + + The `transliterate` function serves as a versatile transliteration tool, offering a range of transliteration engines to choose from. It provides flexibility and customization for your transliteration needs. + .. autofunction:: pronunciate + :noindex: + + This function provides assistance in generating phonetic representations of Thai words, which is particularly useful for language learning and pronunciation practice. + .. autofunction:: puan -.. autoclass:: pythainlp.transliterate.wunsen.WunsenTransliterate - :members: + :noindex: -Romanize Engines ----------------- -thai2rom -++++++++ -.. automodule:: pythainlp.transliterate.thai2rom.romanize -royin -+++++ -.. automodule:: pythainlp.transliterate.royin.romanize + The `puan` function offers a unique transliteration feature known as "Puan." It provides a specialized transliteration method for Thai text and is an additional option for rendering Thai text into English characters. -Transliterate Engines ---------------------- +.. autoclass:: pythainlp.transliterate.wunsen.WunsenTransliterate + :members: + + The `WunsenTransliterate` class represents a transliteration engine known as "Wunsen." It offers specific transliteration methods for rendering Thai text into a phonetic English format. -icu -+++ -.. automodule:: pythainlp.transliterate.pyicu +Transliteration Engines +----------------------- -.. autofunction:: pythainlp.transliterate.pyicu.transliterate +**thai2rom** + +.. automodule:: pythainlp.transliterate.thai2rom.romanize + :members: + + The `thai2rom` engine specializes in transliterating Thai text into romanized form. It's particularly useful for rendering Thai words accurately in an English phonetic format. -ipa -+++ -.. automodule:: pythainlp.transliterate.ipa -.. autofunction:: pythainlp.transliterate.ipa.transliterate -.. autofunction:: pythainlp.transliterate.ipa.trans_list -.. autofunction:: pythainlp.transliterate.ipa.xsampa_list +**royin** + +.. automodule:: pythainlp.transliterate.royin.romanize + :members: + + The `royin` engine focuses on transliterating Thai text into English characters. It provides an alternative approach to transliteration, ensuring accurate representation of Thai words. -thaig2p -+++++++ -.. automodule:: pythainlp.transliterate.thaig2p.transliterate -.. autofunction:: pythainlp.transliterate.thaig2p.transliterate +**Transliterate Engines** -tltk -++++ -.. autofunction:: pythainlp.transliterate.tltk.romanize -.. autofunction:: pythainlp.transliterate.tltk.tltk_g2p -.. autofunction:: pythainlp.transliterate.tltk.tltk_ipa +This section includes multiple transliteration engines designed to suit various use cases. They offer unique methods for transliterating Thai text into romanized form: -iso_11940 -+++++++++ -.. automodule:: pythainlp.transliterate.iso_11940 +- **icu**: Utilizes the ICU transliteration system for phonetic conversion. +- **ipa**: Provides International Phonetic Alphabet (IPA) representation of Thai text. +- **thaig2p**: Transliterates Thai text into the Grapheme-to-Phoneme (G2P) representation. +- **tltk**: Utilizes the TLTK transliteration system for a specific approach to transliteration. +- **iso_11940**: Focuses on the ISO 11940 transliteration standard. References ---------- .. [#rtgs_transcription] Nitaya Kanchanawan. (2006). `Romanization, Transliteration, and Transcription for the Globalization of the Thai Language. `_ The Journal of the Royal Institute of Thailand. + +The `pythainlp.transliterate` module offers a comprehensive set of tools and engines for transliterating Thai text into Romanized form. Whether you need a simple transliteration, specific engines for accurate representation, or phonetic rendering, this module provides a wide range of options. Additionally, the module references a publication that highlights the significance of Romanization, Transliteration, and Transcription in making the Thai language accessible to a global audience. diff --git a/docs/api/ulmfit.rst b/docs/api/ulmfit.rst index 1f9aa002a..1c65e4b01 100644 --- a/docs/api/ulmfit.rst +++ b/docs/api/ulmfit.rst @@ -2,26 +2,89 @@ pythainlp.ulmfit ==================================== - -Universal Language Model Fine-tuning for Text Classification (ULMFiT). +Welcome to the `pythainlp.ulmfit` module, where you'll find powerful tools for Universal Language Model Fine-tuning for Text Classification (ULMFiT). ULMFiT is a cutting-edge technique for training deep learning models on large text corpora and then fine-tuning them for specific text classification tasks. Modules ------- + .. autoclass:: ThaiTokenizer + :members: + + The `ThaiTokenizer` class is a critical component of ULMFiT, designed for tokenizing Thai text effectively. Tokenization is the process of breaking down text into individual tokens, and this class allows you to do so with precision and accuracy. + .. autofunction:: document_vector + :noindex: + + The `document_vector` function is a powerful tool that computes document vectors for text data. This functionality is often used in text classification tasks where you need to represent documents as numerical vectors for machine learning models. + .. autofunction:: fix_html + :noindex: + + The `fix_html` function is a text preprocessing utility that handles HTML-specific characters, making text cleaner and more suitable for text classification. + .. autofunction:: lowercase_all + :noindex: + + The `lowercase_all` function is a text processing utility that converts all text to lowercase. This is useful for ensuring uniformity in text data and reducing the complexity of text classification tasks. + .. autofunction:: merge_wgts + :noindex: + + The `merge_wgts` function is a tool for merging weight arrays, which can be crucial for managing and fine-tuning deep learning models in ULMFiT. + .. autofunction:: process_thai + :noindex: + + The `process_thai` function is designed for preprocessing Thai text data, a vital step in preparing text for ULMFiT-based text classification. + .. autofunction:: rm_brackets + :noindex: + + The `rm_brackets` function removes brackets from text, making it more suitable for text classification tasks that don't require bracket information. + .. autofunction:: rm_useless_newlines + :noindex: + + The `rm_useless_newlines` function eliminates unnecessary newlines in text data, ensuring that text is more compact and easier to work with in ULMFiT-based text classification. + .. autofunction:: rm_useless_spaces + :noindex: + + The `rm_useless_spaces` function removes extraneous spaces from text, making it cleaner and more efficient for ULMFiT-based text classification. + .. autofunction:: remove_space + :noindex: + + The `remove_space` function is a utility for removing space characters from text data, streamlining the text for classification purposes. + .. autofunction:: replace_rep_after + :noindex: + + The `replace_rep_after` function is a text preprocessing tool for replacing repeated characters in text with a single occurrence. This step helps in standardizing text data for text classification. + .. autofunction:: replace_rep_nonum + :noindex: + + The `replace_rep_nonum` function is similar to `replace_rep_after`, but it focuses on replacing repeated characters without considering numbers. + .. autofunction:: replace_wrep_post + :noindex: + + The `replace_wrep_post` function is used for replacing repeated words in text with a single occurrence. This function helps in reducing redundancy in text data, making it more efficient for text classification tasks. + .. autofunction:: replace_wrep_post_nonum + :noindex: + + Similar to `replace_wrep_post`, the `replace_wrep_post_nonum` function removes repeated words without considering numbers in the text. + .. autofunction:: spec_add_spaces + :noindex: + + The `spec_add_spaces` function is a text processing tool for adding spaces between special characters in text data. This step helps in standardizing text for ULMFiT-based text classification. + .. autofunction:: ungroup_emoji + :noindex: + + The `ungroup_emoji` function is designed for ungrouping emojis in text data, which can be crucial for emoji recognition and classification tasks. -:members: tokenizer +The `pythainlp.ulmfit` module provides a comprehensive set of tools for ULMFiT-based text classification. Whether you need to preprocess Thai text, tokenize it, compute document vectors, or perform various text cleaning tasks, this module has the utilities you need. ULMFiT is a state-of-the-art technique in NLP, and these tools empower you to use it effectively for text classification. diff --git a/docs/api/util.rst b/docs/api/util.rst index ecb23df99..f8a9ed40d 100644 --- a/docs/api/util.rst +++ b/docs/api/util.rst @@ -2,61 +2,267 @@ pythainlp.util ===================================== -The :class:`pythainlp.util` contains utility functions, like text conversion and formatting +The :mod:`pythainlp.util` module serves as a treasure trove of utility functions designed to aid text conversion, formatting, and various language processing tasks in the context of Thai language. Modules ------- .. autofunction:: abbreviation_to_full_text + :noindex: + + The `abbreviation_to_full_text` function is a text processing tool for converting common Thai abbreviations into their full, expanded forms. It's invaluable for improving text readability and clarity. + .. autofunction:: arabic_digit_to_thai_digit + :noindex: + + The `arabic_digit_to_thai_digit` function allows you to transform Arabic numerals into their Thai numeral equivalents. This utility is especially useful when working with Thai numbers in text data. + .. autofunction:: bahttext + :noindex: + + The `bahttext` function specializes in converting numerical values into Thai Baht text, an essential feature for rendering financial data or monetary amounts in a user-friendly Thai format. + .. autofunction:: convert_years + :noindex: + + The `convert_years` function is designed to facilitate the conversion of Western calendar years into Thai Buddhist Era (BE) years. This is significant for presenting dates and years in a Thai context. + .. autofunction:: collate + :noindex: + + The `collate` function is a versatile tool for sorting Thai text in a locale-specific manner. It ensures that text data is sorted correctly, taking into account the Thai language's unique characteristics. + .. autofunction:: count_thai_chars + :noindex: + + The `count_thai_chars` function is a character counting tool specifically tailored for Thai text. It helps in quantifying Thai characters, which can be useful for various text processing tasks. + .. autofunction:: countthai + :noindex: + + The `countthai` function is a text processing utility for counting the occurrences of Thai characters in text data. This is useful for understanding the prevalence of Thai language content. + .. autofunction:: dict_trie + :noindex: + + The `dict_trie` function implements a Trie data structure for efficient dictionary operations. It's a valuable resource for dictionary management and fast word lookup. + .. autofunction:: digit_to_text + :noindex: + + The `digit_to_text` function is a numeral conversion tool that translates Arabic numerals into their Thai textual representations. This is vital for rendering numbers in Thai text naturally. + .. autofunction:: display_thai_char + :noindex: + + The `display_thai_char` function is designed to present Thai characters with diacritics and tonal marks accurately. This is essential for displaying Thai text with correct pronunciation cues. + .. autofunction:: emoji_to_thai + :noindex: + + The `emoji_to_thai` function focuses on converting emojis into their Thai language equivalents. This is a unique feature for enhancing text communication with Thai-language emojis. + .. autofunction:: eng_to_thai + :noindex: + + The `eng_to_thai` function serves as a text conversion tool for translating English text into its Thai transliterated form. It is beneficial for rendering English words and phrases in a Thai context. + .. autofunction:: find_keyword + :noindex: + + The `find_keyword` function is a powerful utility for identifying keywords and key phrases in text data. It is a fundamental component for text analysis and information extraction tasks. + .. autofunction:: ipa_to_rtgs + :noindex: + + The `ipa_to_rtgs` function focuses on converting International Phonetic Alphabet (IPA) transcriptions into Royal Thai General System of Transcription (RTGS) format. This is valuable for phonetic analysis and pronunciation guides. + .. autofunction:: is_native_thai + :noindex: + + The `is_native_thai` function is a language detection tool that identifies whether text is predominantly in the Thai language or not. It aids in language identification and text categorization tasks. + .. autofunction:: isthai + :noindex: + + The `isthai` function is a straightforward language detection utility that determines if text contains Thai language content. This function is essential for language-specific text processing. + .. autofunction:: isthaichar + :noindex: + + The `isthaichar` function is designed to check if a character belongs to the Thai script. It helps in character-level language identification and text processing. + .. autofunction:: maiyamok + :noindex: + + The `maiyamok` function is a text processing tool that assists in identifying and processing Thai character characters with a 'mai yamok' tone mark. + .. autofunction:: nectec_to_ipa + :noindex: + + The `nectec_to_ipa` function focuses on converting text from the NECTEC phonetic transcription system to the International Phonetic Alphabet (IPA). This conversion is vital for linguistic analysis and phonetic representation. + .. autofunction:: normalize + :noindex: + + The `normalize` function is a text processing utility that standardizes text by removing diacritics, tonal marks, and other modifications. It is valuable for text normalization and linguistic analysis. + .. autofunction:: now_reign_year + :noindex: + + The `now_reign_year` function computes the current Thai Buddhist Era (BE) year and provides it in a human-readable format. This function is essential for displaying the current year in a Thai context. + .. autofunction:: num_to_thaiword + :noindex: + + The `num_to_thaiword` function is a numeral conversion tool for translating Arabic numerals into Thai word form. It is crucial for rendering numbers in a natural Thai textual format. + .. autofunction:: rank + :noindex: + + The `rank` function is designed for ranking and ordering a list of items. It is a general-purpose utility for ranking items based on various criteria. + .. autofunction:: reign_year_to_ad + :noindex: + + The `reign_year_to_ad` function facilitates the conversion of Thai Buddhist Era (BE) years into Western calendar years. This is useful for displaying historical dates in a globally recognized format. + .. autofunction:: remove_dangling + :noindex: + + The `remove_dangling` function is a text processing tool for removing dangling characters or diacritics from text. It is useful for text cleaning and normalization. + .. autofunction:: remove_dup_spaces + :noindex: + + The `remove_dup_spaces` function focuses on removing duplicate space characters from text data, making it more consistent and readable. + .. autofunction:: remove_repeat_vowels + :noindex: + + The `remove_repeat_vowels` function is designed to eliminate repeated vowel characters in text, improving text readability and consistency. + .. autofunction:: remove_tone_ipa + :noindex: + + The `remove_tone_ipa` function serves as a phonetic conversion tool for removing tone marks from IPA transcriptions. This is crucial for phonetic analysis and linguistic research. + .. autofunction:: remove_tonemark + :noindex: + + The `remove_tonemark` function is a utility for removing tonal marks and diacritics from text data, making it suitable for various text processing tasks. + .. autofunction:: remove_zw + :noindex: + + The `remove_zw` function is designed to remove zero-width characters from text data, ensuring that text is free from invisible or unwanted characters. + .. autofunction:: reorder_vowels + :noindex: + + The `reorder_vowels` function is a text processing utility for reordering vowel characters in Thai text. It is essential for phonetic analysis and pronunciation guides. + .. autofunction:: sound_syllable + :noindex: + + The `sound_syllable` function specializes in identifying and processing Thai characters that represent sound syllables. This is valuable for phonetic and linguistic analysis. + .. autofunction:: syllable_length + :noindex: + + The `syllable_length` function is a text analysis tool for calculating the length of syllables in Thai text. It is significant for linguistic analysis and language research. + .. autofunction:: syllable_open_close_detector + :noindex: + + The `syllable_open_close_detector` function is designed to detect syllable open and close statuses in Thai text. This information is vital for phonetic analysis and linguistic research. + .. autofunction:: text_to_arabic_digit + :noindex: + + The `text_to_arabic_digit` function is a numeral conversion tool that translates Thai text numerals into Arabic numeral form. It is useful for numerical data extraction and processing. + .. autofunction:: text_to_num + :noindex: + + The `text_to_num` function focuses on extracting numerical values from text data. This is essential for converting textual numbers into numerical form for computation. + .. autofunction:: text_to_thai_digit + :noindex: + + The `text_to_thai_digit` function serves as a numeral conversion tool for translating Arabic numerals into Thai numeral form. This is important for rendering numbers in Thai text naturally. + .. autofunction:: thai_digit_to_arabic_digit + :noindex: + + The `thai_digit_to_arabic_digit` function allows you to transform Thai numeral text into Arabic numeral format. This is valuable for numerical data extraction and computation tasks. + .. autofunction:: thai_strftime + :noindex: + + The `thai_strftime` function is a date formatting tool tailored for Thai culture. It is essential for displaying dates and times in a format that adheres to Thai conventions. + .. autofunction:: thai_strptime + :noindex: + + The `thai_strptime` function focuses on parsing dates and times in a Thai-specific format, making it easier to work with date and time data in a Thai context. + .. autofunction:: thai_to_eng + :noindex: + + The `thai_to_eng` function is a text conversion tool for translating Thai text into its English transliterated form. This is beneficial for rendering Thai words and phrases in an English context. + .. autofunction:: thai_word_tone_detector + :noindex: + + The `thai_word_tone_detector` function specializes in detecting and processing tonal marks in Thai words. It is essential for phonetic analysis and pronunciation guides. + .. autofunction:: thaiword_to_date + :noindex: + + The `thaiword_to_date` function facilitates the conversion of Thai word representations of dates into standardized date formats. This is important for date data extraction and processing. + .. autofunction:: thaiword_to_num + :noindex: + + The `thaiword_to_num` function is a numeral conversion tool for translating Thai word numerals into numerical form. This is essential for numerical data extraction and computation. + .. autofunction:: thaiword_to_time + :noindex: + + The `thaiword_to_time` function is designed for converting Thai word representations of time into standardized time formats. It is crucial for time data extraction and processing. + .. autofunction:: time_to_thaiword + :noindex: + + The `time_to_thaiword` function focuses on converting time values into Thai word representations. This is valuable for rendering time in a natural Thai textual format. + .. autofunction:: tis620_to_utf8 + :noindex: + + The `tis620_to_utf8` function serves as a character encoding conversion tool for converting TIS-620 encoded text into UTF-8 format. This is significant for character encoding compatibility. + .. autofunction:: tone_detector + :noindex: + + The `tone_detector` function is a text processing tool for detecting tone marks and diacritics in Thai text. It is essential for phonetic analysis and pronunciation guides. + .. autofunction:: words_to_num + :noindex: + + The `words_to_num` function is a numeral conversion utility that translates Thai word numerals into numerical form. It is important for numerical data extraction and computation. + .. autofunction:: pythainlp.util.spell_words.spell_syllable + :noindex: + + The `pythainlp.util.spell_words.spell_syllable` function focuses on spelling syllables in Thai text, an important feature for phonetic analysis and linguistic research. + .. autofunction:: pythainlp.util.spell_words.spell_word + :noindex: + + The `pythainlp.util.spell_words.spell_word` function is designed for spelling individual words in Thai text, facilitating phonetic analysis and pronunciation guides. + .. autoclass:: Trie - :members: + :members: + + The `Trie` class is a data structure for efficient dictionary operations. It's a valuable resource for managing and searching word lists and dictionaries in a structured and efficient manner. diff --git a/docs/api/wangchanberta.rst b/docs/api/wangchanberta.rst index 8752538e9..7162dbfe4 100644 --- a/docs/api/wangchanberta.rst +++ b/docs/api/wangchanberta.rst @@ -2,12 +2,11 @@ pythainlp.wangchanberta ======================= +The `pythainlp.wangchanberta` module is built upon the WangchanBERTa base model, specifically the `wangchanberta-base-att-spm-uncased` model, as detailed in the paper by Lowphansirikul et al. [^Lowphansirikul_2021]. -WangchanBERTa base model: wangchanberta-base-att-spm-uncased [#Lowphansirikul_2021]_ +This base model is utilized for various natural language processing tasks in the Thai language, including named entity recognition, part-of-speech tagging, and subword tokenization. -We used WangchanBERTa for Thai name tagger task, part-of-speech and subword tokenizer. - -If you want to finetune model, You can read https://github.com/vistec-AI/thai2transformers +If you intend to fine-tune the model or explore its capabilities further, please refer to the [thai2transformers repository](https://github.com/vistec-AI/thai2transformers). **Speed Benchmark** @@ -19,7 +18,7 @@ pythainlp.wangchanberta (CPU) 9.64 s 9.65 s pythainlp.wangchanberta (GPU) 8.02 s 8 s ============================= ======================== ============== -Notebook: +For a comprehensive performance benchmark, the following notebooks are available: - `PyThaiNLP basic function and pythainlp.wangchanberta CPU at Google Colab`_ @@ -32,14 +31,20 @@ Modules ------- .. autoclass:: NamedEntityRecognition :members: + + The `NamedEntityRecognition` class is a fundamental component for identifying named entities in Thai text. It allows you to extract entities such as names, locations, and organizations from text data. + .. autoclass:: ThaiNameTagger :members: + + The `ThaiNameTagger` class is designed for tagging Thai names within text. This is essential for tasks such as entity recognition, information extraction, and text classification. + .. autofunction:: segment + :noindex: + + The `segment` function is a subword tokenization tool that breaks down text into subword units, offering a foundation for further text processing and analysis. References ---------- -.. [#Lowphansirikul_2021] Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. - WangchanBERTa: Pretraining transformer-based Thai Language Models. - arXiv:210109635 [cs] [Internet]. 2021 Jan 23 [cited 2021 Feb 27]; - Available from: http://arxiv.org/abs/2101.09635 +[^Lowphansirikul_2021] Lowphansirikul L, Polpanumas C, Jantrakulchai N, Nutanong S. WangchanBERTa: Pretraining transformer-based Thai Language Models. [ArXiv:2101.09635](http://arxiv.org/abs/2101.09635) [Internet]. 2021 Jan 23 [cited 2021 Feb 27]. diff --git a/docs/api/word_vector.rst b/docs/api/word_vector.rst index b9c4b2cd1..06385b0d9 100644 --- a/docs/api/word_vector.rst +++ b/docs/api/word_vector.rst @@ -1,26 +1,52 @@ .. currentmodule:: pythainlp.word_vector pythainlp.word_vector -==================================== -The :class:`word_vector` contains functions that make use of a pre-trained vector of public data. +======================= +The :class:`word_vector` contains functions that makes use of a pre-trained vector public data. +The `pythainlp.word_vector` module is a valuable resource for working with pre-trained word vectors. These word vectors are trained on large corpora and can be used for various natural language processing tasks, such as word similarity, document similarity, and more. Dependencies ------------- +======================= Installation of :mod:`numpy` and :mod:`gensim` is required. +Before using this module, you need to ensure that the `numpy` and `gensim` libraries are installed in your environment. These libraries are essential for loading and working with the pre-trained word vectors. + Modules ------- - .. autofunction:: doesnt_match + :noindex: + + The `doesnt_match` function is designed to identify the word that does not match a set of words in terms of semantic similarity. It is useful for tasks like word sense disambiguation. + .. autofunction:: get_model + :noindex: + + The `get_model` function allows you to load a pre-trained word vector model, which can then be used for various word vector operations. This function serves as the entry point for accessing pre-trained word vectors. + .. autofunction:: most_similar_cosmul + :noindex: + + The `most_similar_cosmul` function finds words that are most similar to a given word in terms of cosine similarity. This function is useful for word analogy tasks and word similarity measurement. + .. autofunction:: sentence_vectorizer + :noindex: + + The `sentence_vectorizer` function takes a sentence as input and returns a vector representation of the entire sentence based on word vectors. This is valuable for document similarity and text classification tasks. + .. autofunction:: similarity + :noindex: + + The `similarity` function calculates the cosine similarity between two words based on their word vectors. It helps in measuring the semantic similarity between words. + .. autoclass:: WordVector :members: + The `WordVector` class encapsulates word vector operations and functions. It provides a convenient interface for loading models, finding word similarities, and generating sentence vectors. + References ---------- -.. [#OmerLevy_YoavGoldberg_2014] Omer Levy and Yoav Goldberg (2014). - Linguistic Regularities in Sparse and Explicit Word Representations. +- [Omer Levy and Yoav Goldberg (2014). Linguistic Regularities in Sparse and Explicit Word Representations](https://www.aclweb.org/anthology/W14-1618/) + This reference points to the work by Omer Levy and Yoav Goldberg, which discusses linguistic regularities in word representations. It underlines the theoretical foundation of word vectors and their applications in NLP. + +This enhanced documentation provides a more detailed and organized overview of the `pythainlp.word_vector` module, making it a valuable resource for NLP practitioners and researchers working with pre-trained word vectors in the Thai language. diff --git a/docs/api/wsd.rst b/docs/api/wsd.rst index 30656b4ff..0fe563cd2 100644 --- a/docs/api/wsd.rst +++ b/docs/api/wsd.rst @@ -2,11 +2,15 @@ pythainlp.wsd ============= - -The :class:`pythainlp.wsd` contains functions used to get word senses for Thai Word Sense Disambiguation (WSD). - +The :class:`pythainlp.wsd` contains get word sense function for Thai Word Sense Disambiguation (WSD). +The `pythainlp.wsd` module is designed to assist in Word Sense Disambiguation (WSD) for the Thai language. Word Sense Disambiguation is a crucial task in natural language processing that involves determining the correct sense or meaning of a word within a given context. This module provides a function for achieving precisely that. Modules ------- +.. autofunction:: get_sense + + The `get_sense` function is the primary tool within this module for performing Word Sense Disambiguation in Thai text. Given a word and its context, this function returns the most suitable sense or meaning for that word. This is particularly useful for tasks where word sense ambiguity needs to be resolved, such as text understanding and translation. + +By using the `pythainlp.wsd` module, you can enhance the accuracy of your NLP applications when dealing with Thai text, ensuring that words are interpreted in the correct context. -.. autofunction:: get_sense \ No newline at end of file +This improved documentation offers a clear and concise explanation of the purpose of the `pythainlp.wsd` module and its primary function, `get_sense`, in the context of Word Sense Disambiguation. It helps users understand the module's utility in disambiguating word senses within the Thai language, which is valuable for a wide range of NLP applications.