Skip to content

[Suggestion] Add consonant-remover method #860

@konbraphat51

Description

@konbraphat51

Detailed description

I suggest to add a dictionary-based consonant-remover method.
As like เริศศศศศศศศศศศศศศ -> เริศ

Context

I am doing text mining of Pantip. I saw that there are not few people write like "เริศศศศศศศศศศศศศศ", to express their emotions. Current pythainlp.utils.normalize() removes only vowels duplication, so there is no method to handle this now. Current tokenizers may separate this as "เริศ / ศศศศศศศศศศศศศ", but it becomes a noise of analysis.
Plus the implementation was a little long, so I wanted this method in pythainlp library

Possible implementation

My implementation was like below.

       #>>against เริศศศศศศศศศศศศศศ

        if (len(sentence) > 2) and pythainlp.util.isthaichar(sentence[-1]) and (sentence[-1] == sentence[-2]):
            # The last of the sentence has duplication (duplication typically at the last)

            dup = sentence[-1]
        
            #find the words in the dictionary that has duplication at the last
            #required here because dictio dynamically added
            repeaters = []
            for word in dictio:
                if (len(word) > 2) and (word[-1] == dup) and (word[-2] == dup):
                    all_same = True
                    for cnt_1 in range(len(word)):
                        if word[cnt_1] != dup:
                            all_same = False
                            break
                    if not all_same:
                        repeaters.append(word)
                    
            #check if there is matching with repeaters
            sentence_head = sentence
            while(sentence_head[-1] == dup):
                if (len(sentence_head) == 1):
                    break
                
                sentence_head = sentence_head[:-1]

            found = False
            for repeater in repeaters:
                rep_head = repeater
                
                repetition = 0
                while(rep_head[-1] == dup):
                    rep_head = rep_head[:-1]
                    repetition += 1
                    
                if sentence_head[-len(rep_head):] == rep_head:
                    found = True
                    break
                    
            if found:
                sentences[cnt] = sentence_head + (dup * repetition)
            else:
                sentences[cnt] = sentence_head + (dup * 1)

If this plan seems good, I could make a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions