Add wangchanberta #540

wannaphong · 2021-02-26T09:45:02Z

What does this changes

I add ner from wangchanberta model.

GitHub: https://github.com/vistec-AI/thai2transformers

How to used

from pythainlp.wangchanberta import ThaiNameTagger
ner = ThaiNameTagger(dataset_name = "thainer") # dataset_name: thainer or lst20
print(ner.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์")) #[('ทดสอบผมมีชื่อว่า ', 'O'), ('นายวรรณพงษ์ ภัททิยไพบูลย์', 'PERSON')]
print(ner.get_ner("ทดสอบผมมีชื่อว่า นายวรรณพงษ์ ภัททิยไพบูลย์", tag=True)) #'ทดสอบผมมีชื่อว่า <PERSON>นายวรรณพงษ์ ภัททิยไพบูลย์</PERSON>'

Your checklist for this pull request

🚨Please review the guidelines for contributing to this repository.

Passed code styles and structures
Passed code linting checks and unit test

pep8speaks · 2021-02-26T09:45:07Z

Hello @wannaphong! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-15 15:56:14 UTC

coveralls · 2021-02-26T09:53:34Z

Coverage decreased (-0.08%) to 95.728% when pulling ff6d300 on add-ner-thai2transformers into ca10958 on dev.

fixed grouped_entities

wannaphong · 2021-02-26T15:55:00Z

Google colab notebook for test: https://colab.research.google.com/drive/1VfM7161u5ExKD6vFMDmxFTZf19EnrhgI?usp=sharing

cstorm125

Can you fix the behavior where it adds _ in the beginning of sentence token? This is an artifact from when we train the model:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ▁ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

I think it would be better for users if we return:

t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>

Also, I understand this is probably not best practice but the model tends to split entities like มหาลัย and จุฬา above (both labelled B-ORG by the model). Do you think it will be better if we merge B-X, B-X to X even though it is not B-X, I-X?

cstorm125

Should we allow lst20 and thainer as choices?

wannaphong · 2021-03-06T08:46:32Z

Should we allow lst20 and thainer as choices?

Yes, It can choices lst20 or thainer. ThaiNameTagger(dataset_name="lst20") or ThaiNameTagger(dataset_name="thainer")

wannaphong · 2021-03-06T08:49:58Z

Can you fix the behavior where it adds _ in the beginning of sentence token? This is an artifact from when we train the model:
t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ▁ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>
I think it would be better for users if we return:
t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>
Also, I understand this is probably not best practice but the model tends to split entities like มหาลัย and จุฬา above (both labelled B-ORG by the model). Do you think it will be better if we merge B-X, B-X to X even though it is not B-X, I-X?

Fixed.

cstorm125 · 2021-03-07T09:32:29Z

Can you fix the behavior where it adds _ in the beginning of sentence token? This is an artifact from when we train the model:
t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ▁ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>
I think it would be better for users if we return:
t.get_ner("ผมกินข้าวที่มหาลัยจุฬา",tag=True)
>> ผมกินข้าวที่<ORGANIZATION>มหาลัย</ORGANIZATION><ORGANIZATION>จุฬา</ORGANIZATION>
Also, I understand this is probably not best practice but the model tends to split entities like มหาลัย and จุฬา above (both labelled B-ORG by the model). Do you think it will be better if we merge B-X, B-X to X even though it is not B-X, I-X?
Fixed.

still returns

t.get_ner("โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ",tag=False)
>> [('', 'B-ORG'),
 ('โรงเรียน', 'B-ORG'),
 ('สวนกุหลาบ', 'B-ORG'),
 ('เป็น', 'O'),
 ('โรงเรียน', 'O'),
 ('ที่ดี', 'O'),
 (' ', 'O'),
 ('แต่ไม่มี', 'B-ORG'),
 ('สวนกุหลาบ', 'B-ORG')]

p16i · 2021-03-08T11:18:39Z

Before merging this PR, perhaps, it might be useful to compare the proposed and the equivalent existing ones in PyThaiNLP.

According to โมเดลประมวลผลภาษาไทยที่ใหญ่และก้าวหน้าที่สุดในขณะนี้ (Medium, 2020), it seems WangchanBERTa significantly outperforms CRF only on ThaiNER-NER. Please correct me if I'm wrong, but I guess CRF is likely to be much smaller (and hence faster) than WangchanBERTa.

IMHO, doing this comparison would allow us to

provide recommendations to the use on which method to use (in which situation)
know what aspects we should improve upon.

cstorm125 · 2021-03-08T12:46:23Z

@wannaphong The issue for WangchanBERTa has been resolved in the getting started notebook.
Can you make sure that your implementation gives the same results?

#using thainer dataset
โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ

[{'entity_group': 'ORGANIZATION', 'score': 0.782967746257782, 'word': ''},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9278752207756042,
  'word': 'โรงเรียน'},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9350618720054626,
  'word': 'สวนกุหลาบ'},
 {'entity_group': 'O',
  'score': 0.8276164361408779,
  'word': 'เป็นโรงเรียนที่ดี<_> แต่ไม่มีสวนกุหลาบ'}]

cstorm125 · 2021-03-08T12:47:31Z

Before merging this PR, perhaps, it might be useful to compare the proposed and the equivalent existing ones in PyThaiNLP.

According to โมเดลประมวลผลภาษาไทยที่ใหญ่และก้าวหน้าที่สุดในขณะนี้ (Medium, 2020), it seems WangchanBERTa significantly outperforms CRF only on ThaiNER-NER. Please correct me if I'm wrong, but I guess CRF is likely to be much smaller (and hence faster) than WangchanBERTa.

IMHO, doing this comparison would allow us to

provide recommendations to the use on which method to use (in which situation)

know what aspects we should improve upon.

Agreed speed benchmark is a nice thing to have, although I would not mind merging without since users will have the choice to choose which modules (CRF or WangchanBERTa) when they use it anyways.

wannaphong · 2021-03-09T14:11:43Z

@wannaphong The issue for WangchanBERTa has been resolved in the getting started notebook.
Can you make sure that your implementation gives the same results?

#using thainer dataset
โรงเรียนสวนกุหลาบเป็นโรงเรียนที่ดี แต่ไม่มีสวนกุหลาบ

[{'entity_group': 'ORGANIZATION', 'score': 0.782967746257782, 'word': ''},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9278752207756042,
  'word': 'โรงเรียน'},
 {'entity_group': 'ORGANIZATION',
  'score': 0.9350618720054626,
  'word': 'สวนกุหลาบ'},
 {'entity_group': 'O',
  'score': 0.8276164361408779,
  'word': 'เป็นโรงเรียนที่ดี<_> แต่ไม่มีสวนกุหลาบ'}]

Fixed

cstorm125 · 2021-03-11T16:10:52Z

LGTM. You can add the speed benchmark @heytitle asked for later.

wannaphong · 2021-03-13T15:22:12Z

Before merging this PR, perhaps, it might be useful to compare the proposed and the equivalent existing ones in PyThaiNLP.

According to โมเดลประมวลผลภาษาไทยที่ใหญ่และก้าวหน้าที่สุดในขณะนี้ (Medium, 2020), it seems WangchanBERTa significantly outperforms CRF only on ThaiNER-NER. Please correct me if I'm wrong, but I guess CRF is likely to be much smaller (and hence faster) than WangchanBERTa.

IMHO, doing this comparison would allow us to
* provide recommendations to the use on which method to use (in which situation)

* know what aspects we should improve upon.

I think that could be done later.

wannaphong · 2021-03-15T14:40:19Z

Speed Benchmark

Function	Named Entity Recognition	Part of Speech
PyThaiNLP basic function (CRF for NER and perceptron model for POS)	89.7 ms	312 ms
pythainlp.wangchanberta (CPU)	9.64 s	9.65 s
pythainlp.wangchanberta (GPU)	8.02 s	8 s

Notebook:

cstorm125 · 2021-03-15T14:48:00Z

Speed Test

Function Named Entity Recognition Part of Speech
PyThaiNLP basic function (CRF for NER and perceptron model for POS) 89.7 ms 312 ms
pythainlp.wangchanberta (CPU) 9.64 s 9.65 s
pythainlp.wangchanberta (GPU) 8.02 s 8 s
Notebook:

PyThaiNLP basic function and pythainlp.wangchanberta CPU at Google Colab

pythainlp.wangchanberta GPU

LGTM

Add wangchanberta

43d2e55

Update wangchanberta.py

8dd0f21

wannaphong added 12 commits February 26, 2021 17:02

fixed IOB

fc20b4b

Update wangchanberta.py

7837725

fixed grouped_entities

Update wangchanberta.py

f1548f6

Add wangchanberta.PosTagTransformers

ca551d2

Move file to pythainlp.wangchanberta

266d8f6

Update wangchanberta requirements

df20fd2

Update postag.py

c94f241

Update core.py

e4f7ba1

Update core.py

79f0b83

Update core.py

ca81865

Add test

c4c1c4c

Update test

a57bd4a

wannaphong added 8 commits February 26, 2021 22:55

Update test_wangchanberta.py

68bf050

Add pythainlp.wangchanberta docs

757f9ac

Update tokenize.rst

628cf50

Fixed PEP8

8abe1b4

Update test_wangchanberta.py

8d1cbdb

Update tests

4827c7d

Fixed PEP8

1e4a0d4

Fixed PEP8

8de00f7

cstorm125 requested changes Mar 5, 2021

View reviewed changes

cstorm125 reviewed Mar 5, 2021

View reviewed changes

wannaphong added 2 commits March 6, 2021 15:48

Update core.py

f1b0a0e

Update core.py

f8a0efa

Update core

f5ae3ad

wannaphong added 4 commits March 11, 2021 15:06

Fixed PEP8

00b2753

Update core.py

c637c8a

Update core.py

f8d438a

Update core.py

87b6119

wannaphong added this to the 2.3 milestone Mar 11, 2021

Update wangchanberta.rst

9e04a18

wannaphong mentioned this pull request Mar 11, 2021

Add wangchanberta notebook PyThaiNLP/tutorials#20

Merged

wannaphong changed the title ~~[WIP] Add wangchanberta~~ Add wangchanberta Mar 13, 2021

Update pos_tag docs

34a034a

cstorm125 approved these changes Mar 13, 2021

View reviewed changes

Update pos_tag.py

5bbbebe

wannaphong added 2 commits March 15, 2021 22:12

Add pythainlp.wangchanberta Speed Benchmark

e4cf8da

Update docs

ff6d300

wannaphong merged commit 208e063 into dev Mar 15, 2021

wannaphong deleted the add-ner-thai2transformers branch March 18, 2021 16:54

wannaphong mentioned this pull request Mar 25, 2021

PyThaiNLP 2.3 change log #445

Closed

Add wangchanberta #540

Add wangchanberta #540

Uh oh!

Conversation

wannaphong commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this changes

Your checklist for this pull request

Uh oh!

pep8speaks commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-03-15 15:56:14 UTC

Uh oh!

coveralls commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wannaphong commented Feb 26, 2021

Uh oh!

cstorm125 left a comment

Choose a reason for hiding this comment

Uh oh!

cstorm125 left a comment

Choose a reason for hiding this comment

Uh oh!

wannaphong commented Mar 6, 2021

Uh oh!

wannaphong commented Mar 6, 2021

Uh oh!

cstorm125 commented Mar 7, 2021

Uh oh!

p16i commented Mar 8, 2021

Uh oh!

cstorm125 commented Mar 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cstorm125 commented Mar 8, 2021

Uh oh!

wannaphong commented Mar 9, 2021

Uh oh!

cstorm125 commented Mar 11, 2021

Uh oh!

wannaphong commented Mar 13, 2021

Uh oh!

wannaphong commented Mar 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed Benchmark

Uh oh!

cstorm125 commented Mar 15, 2021

Speed Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wannaphong commented Feb 26, 2021 •

edited

Loading

pep8speaks commented Feb 26, 2021 •

edited

Loading

coveralls commented Feb 26, 2021 •

edited

Loading

cstorm125 commented Mar 8, 2021 •

edited

Loading

wannaphong commented Mar 15, 2021 •

edited

Loading