-
Notifications
You must be signed in to change notification settings - Fork 285
Fix MetaSound + Adjust tokenizer selector + More documentation + clean code #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
eee54c1
Sorting tokenizers
bact 7ef0098
Merge pull request #3 from PyThaiNLP/dev
bact 060dae2
Update doc
bact 370a207
- consistent indentation
bact 2c5fbd4
Merge pull request #4 from PyThaiNLP/dev
bact 101cdc8
update doc
bact ff654d1
- Fix tokenizer selector
bact 02748f9
delete mkdocs.yml
bact d20e8c4
Merge pull request #5 from PyThaiNLP/dev
bact 4f2dd0a
revert MetaSound for now
bact ad1f8f9
remove unused imports
bact 3c1230a
Merge branch 'dev' of https://github.com/bact/pythainlp into dev
bact 94ae5be
Fix tokenizer selector
bact a646f5c
Fix metasound
bact 995b0ea
trying to rename MetaSound.py to metasound.py (step 1 - temporary)
bact fb229b2
rename MetaSound.py to metasound.py (step 2 - finish)
bact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
MetaSound - Thai soundex system | ||
|
||
References: | ||
Snae & Brückner. (2009). Novel Phonetic Name Matching Algorithm with a Statistical | ||
Ontology for Analysing Names Given in Accordance with Thai Astrology. | ||
https://pdfs.semanticscholar.org/3983/963e87ddc6dfdbb291099aa3927a0e3e4ea6.pdf | ||
""" | ||
|
||
_CONS_THANTHAKHAT = "กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ์" | ||
_THANTHAKHAT = "์" # \u0e4c | ||
_C1 = "กขฃคฆฅ" # sound K -> coded letter 1 | ||
_C2 = "จฉชฌซฐทฒดฎตสศษ" # D -> 2 | ||
_C3 = "ฟฝพผภบป" # B -> 3 | ||
_C4 = "ง" # NG -> 4 | ||
_C5 = "ลฬรนณฦญ" # N -> 5 | ||
_C6 = "ม" # M -> 6 | ||
_C7 = "ย" # Y -> 7 | ||
_C8 = "ว" # W -> 8 | ||
|
||
|
||
def metasound(text, length=4): | ||
""" | ||
Thai MetaSound | ||
|
||
:param str text: Thai text | ||
:param int length: preferred length of the MetaSound (default is 4) | ||
:return: MetaSound for the text | ||
**Example**:: | ||
from pythainlp.metasound import metasound | ||
metasound("ลัก") # 'ล100' | ||
metasound("รัก") # 'ร100' | ||
metasound("รักษ์") # 'ร100' | ||
metasound("บูรณการ", 5)) # 'บ5515' | ||
""" | ||
# keep only consonants and thanthakhat | ||
chars = [] | ||
for ch in text: | ||
if ch in _CONS_THANTHAKHAT: | ||
chars.append(ch) | ||
|
||
# remove karan (thanthakhat and a consonant before it) | ||
i = 0 | ||
while i < len(chars): | ||
if chars[i] == _THANTHAKHAT: | ||
if i > 0: | ||
chars[i - 1] = " " | ||
chars[i] = " " | ||
i += 1 | ||
|
||
# retain first consonant, encode the rest | ||
chars = chars[:length] | ||
i = 1 | ||
while i < len(chars): | ||
if chars[i] in _C1: | ||
chars[i] = "1" | ||
elif chars[i] in _C2: | ||
chars[i] = "2" | ||
elif chars[i] in _C3: | ||
chars[i] = "3" | ||
elif chars[i] in _C4: | ||
chars[i] = "4" | ||
elif chars[i] in _C5: | ||
chars[i] = "5" | ||
elif chars[i] in _C6: | ||
chars[i] = "6" | ||
elif chars[i] in _C7: | ||
chars[i] = "7" | ||
elif chars[i] in _C8: | ||
chars[i] = "8" | ||
else: | ||
chars[i] = "0" | ||
i += 1 | ||
|
||
while len(chars) < length: | ||
chars.append("0") | ||
|
||
return "".join(chars) | ||
|
||
|
||
if __name__ == "__main__": | ||
print(metasound("บูรณะ")) # บ550 (an example from the original paper [Figure 4]) | ||
print(metasound("บูรณการ", 5)) # บ5515 | ||
print(metasound("ลักษณะ")) # ล125 | ||
print(metasound("ลัก")) # ล100 | ||
print(metasound("รัก")) # ร100 | ||
print(metasound("รักษ์")) # ร100 | ||
print(metasound("")) # 0000 | ||
|
||
print(metasound("คน")) | ||
print(metasound("คนA")) | ||
print(metasound("ดา")) | ||
print(metasound("ปา")) | ||
print(metasound("งา")) | ||
print(metasound("ลา")) | ||
print(metasound("มา")) | ||
print(metasound("วา")) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,5 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
import sys | ||
|
||
try: | ||
import icu | ||
except ImportError: | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,7 +2,6 @@ | |
""" | ||
Wrapper for deepcut Thai word segmentation | ||
""" | ||
import sys | ||
|
||
try: | ||
import deepcut | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.