Skip to content

Conversation

wannaphong
Copy link
Member

@wannaphong wannaphong commented Feb 26, 2021

What does this changes

I add ner from wangchanberta model.

GitHub: https://github.com/vistec-AI/thai2transformers

How to used

from pythainlp.wangchanberta import ThaiNameTagger
ner = ThaiNameTagger(dataset_name = "thainer") # dataset_name: thainer or lst20
print(ner.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์")) #[('ทดสอบผมมีชื่อว่า ', 'O'), ('นายวรรณพงษ์ ภัททิยไพบูลย์', 'PERSON')]
print(ner.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์", tag=True)) #'ทดสอบผมมีชื่อว่า <PERSON>นายวรรณพงษ์ ภัททิยไพบูลย์</PERSON>'

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

  • Passed code styles and structures
  • Passed code linting checks and unit test

@pep8speaks
Copy link

pep8speaks commented Feb 26, 2021

Hello @wannaphong! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-15 15:56:14 UTC

@coveralls
Copy link

coveralls commented Feb 26, 2021

Coverage Status

Coverage decreased (-0.08%) to 95.728% when pulling ff6d300 on add-ner-thai2transformers into ca10958 on dev.

@wannaphong
Copy link
Member Author

Copy link
Member

@cstorm125 cstorm125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the behavior where it adds _ in the beginning of sentence token? This is an artifact from when we train the model:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ▁ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

I think it would be better for users if we return:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

Also, I understand this is probably not best practice but the model tends to split entities like มหาลัย and จุฬา above (both labelled B-ORG by the model). Do you think it will be better if we merge B-X, B-X to X even though it is not B-X, I-X?

Copy link
Member

@cstorm125 cstorm125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow lst20 and thainer as choices?

@wannaphong
Copy link
Member Author

Should we allow lst20 and thainer as choices?

Yes, It can choices lst20 or thainer. ThaiNameTagger(dataset_name="lst20") or ThaiNameTagger(dataset_name="thainer")

@wannaphong
Copy link
Member Author

Can you fix the behavior where it adds _ in the beginning of sentence token? This is an artifact from when we train the model:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ▁ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

I think it would be better for users if we return:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

Also, I understand this is probably not best practice but the model tends to split entities like มหาลัย and จุฬา above (both labelled B-ORG by the model). Do you think it will be better if we merge B-X, B-X to X even though it is not B-X, I-X?

Fixed.

@cstorm125
Copy link
Member

Can you fix the behavior where it adds _ in the beginning of sentence token? This is an artifact from when we train the model:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ▁ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

I think it would be better for users if we return:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

Also, I understand this is probably not best practice but the model tends to split entities like มหาลัย and จุฬา above (both labelled B-ORG by the model). Do you think it will be better if we merge B-X, B-X to X even though it is not B-X, I-X?

Fixed.

still returns

t.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=False)
>> [('', 'B-ORG'),
 ('โรงเรียน', 'B-ORG'),
 ('สวนกุหลาบ', 'B-ORG'),
 ('เป็น', 'O'),
 ('โรงเรียน', 'O'),
 ('ที่ดี', 'O'),
 (' ', 'O'),
 ('แต่ไม่มี', 'B-ORG'),
 ('สวนกุหลาบ', 'B-ORG')]

@p16i
Copy link
Contributor

p16i commented Mar 8, 2021

Before merging this PR, perhaps, it might be useful to compare the proposed and the equivalent existing ones in PyThaiNLP.

According to โมเดลประมวลผลภาษาไทยที่ใหญ่และก้าวหน้าที่สุดในขณะนี้ (Medium, 2020), it seems WangchanBERTa significantly outperforms CRF only on ThaiNER-NER. Please correct me if I'm wrong, but I guess CRF is likely to be much smaller (and hence faster) than WangchanBERTa.

image

IMHO, doing this comparison would allow us to

  • provide recommendations to the use on which method to use (in which situation)
  • know what aspects we should improve upon.

@cstorm125
Copy link
Member

cstorm125 commented Mar 8, 2021

@wannaphong The issue for WangchanBERTa has been resolved in the getting started notebook.
Can you make sure that your implementation gives the same results?

#using thainer dataset
โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ

[{'entity_group': 'ORGANIZATION', 'score': 0.782967746257782, 'word': ''},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9278752207756042,
  'word': 'โรงเรียน'},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9350618720054626,
  'word': 'สวนกุหลาบ'},
 {'entity_group': 'O',
  'score': 0.8276164361408779,
  'word': 'เป็นโรงเรียนที่ดี<_> แต่ไม่มีสวนกุหลาบ'}]

@cstorm125
Copy link
Member

Before merging this PR, perhaps, it might be useful to compare the proposed and the equivalent existing ones in PyThaiNLP.

According to โมเดลประมวลผลภาษาไทยที่ใหญ่และก้าวหน้าที่สุดในขณะนี้ (Medium, 2020), it seems WangchanBERTa significantly outperforms CRF only on ThaiNER-NER. Please correct me if I'm wrong, but I guess CRF is likely to be much smaller (and hence faster) than WangchanBERTa.

image

IMHO, doing this comparison would allow us to

  • provide recommendations to the use on which method to use (in which situation)
  • know what aspects we should improve upon.

Agreed speed benchmark is a nice thing to have, although I would not mind merging without since users will have the choice to choose which modules (CRF or WangchanBERTa) when they use it anyways.

@wannaphong
Copy link
Member Author

@wannaphong The issue for WangchanBERTa has been resolved in the getting started notebook.
Can you make sure that your implementation gives the same results?

#using thainer dataset
โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ

[{'entity_group': 'ORGANIZATION', 'score': 0.782967746257782, 'word': ''},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9278752207756042,
  'word': 'โรงเรียน'},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9350618720054626,
  'word': 'สวนกุหลาบ'},
 {'entity_group': 'O',
  'score': 0.8276164361408779,
  'word': 'เป็นโรงเรียนที่ดี<_> แต่ไม่มีสวนกุหลาบ'}]

Fixed

@wannaphong wannaphong added this to the 2.3 milestone Mar 11, 2021
@cstorm125
Copy link
Member

LGTM. You can add the speed benchmark @heytitle asked for later.

@wannaphong wannaphong changed the title [WIP] Add wangchanberta Add wangchanberta Mar 13, 2021
@wannaphong
Copy link
Member Author

Before merging this PR, perhaps, it might be useful to compare the proposed and the equivalent existing ones in PyThaiNLP.

According to โมเดลประมวลผลภาษาไทยที่ใหญ่และก้าวหน้าที่สุดในขณะนี้ (Medium, 2020), it seems WangchanBERTa significantly outperforms CRF only on ThaiNER-NER. Please correct me if I'm wrong, but I guess CRF is likely to be much smaller (and hence faster) than WangchanBERTa.

image

IMHO, doing this comparison would allow us to

* provide recommendations to the use on which method to use (in which situation)

* know what aspects we should improve upon.

I think that could be done later.

@wannaphong
Copy link
Member Author

wannaphong commented Mar 15, 2021

Speed Benchmark

Function Named Entity Recognition Part of Speech
PyThaiNLP basic function (CRF for NER and perceptron model for POS) 89.7 ms 312 ms
pythainlp.wangchanberta (CPU) 9.64 s 9.65 s
pythainlp.wangchanberta (GPU) 8.02 s 8 s

Notebook:

@cstorm125
Copy link
Member

Speed Test

Function Named Entity Recognition Part of Speech
PyThaiNLP basic function (CRF for NER and perceptron model for POS) 89.7 ms 312 ms
pythainlp.wangchanberta (CPU) 9.64 s 9.65 s
pythainlp.wangchanberta (GPU) 8.02 s 8 s
Notebook:

LGTM

@wannaphong wannaphong merged commit 208e063 into dev Mar 15, 2021
@wannaphong wannaphong deleted the add-ner-thai2transformers branch March 18, 2021 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants