-
Notifications
You must be signed in to change notification settings - Fork 814
generate unicode strings to test utf-8 handling for all non-IWSLT dataset tests. #1599
Conversation
|
These test failures are actually good. 🔥 |
|
I have a Windows desktop, so instead of using CI as a debugging mechanism I can just wrap this PR up once I'm home. I'm glad we caught these. 😄 |
|
Thanks for taking this on @erip! Just a couple of questions from my side.
|
Edit: a cursory glance at dataset_utils suggests they are all utf8. |
7049f1e to
30c0966
Compare
|
Tracked down the underlying issue with encoding here on Windows. See pytorch/pytorch#72713 for context. |
Seem like the culprit is not setting |
Indeed, one option is to read the files in binary and decode appropriately. Ideally the |
Agreed. This could (should) be handled at File Opening so that downstream readers do not have to worry about it. Thanks for picking this up @erip. Looking forward to the resolution at torchdata level. Meanwhile (since I am not sure if it will take longer to resolve APIs etc for FileOpener), can we try closing this PR by reading in binary mode and setting |
|
Yes, that seems reasonable. |
Great! Then let's try to land this before we make the branch-cut. cc: @Nayef211 |
…DO: replace with FileOpener with appropriate encoding when this lands in upstream pytorch.
|
OK, I think this should be gtg now @parmeet |
parmeet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @erip! This looks good to me. I will merge it once the CI is green for unit-testing.
|
Oh one more change @erip: I know IWSLT 16/17 test suit update is pending, but can we at-least update the reading/decoding part for actual datasets text/torchtext/datasets/iwslt2016.py Lines 334 to 338 in eb61b3f
|
|
Good catch, I'll fix that. |
|
It looks like there's a lingering issue that I'm trying to debug with IMDB. The cache is written as text, but when I try to change it to being written as bytes and taking appropriate encoding compensation before the cache is written... cache_decompressed_dp = (
cache_decompressed_dp.lines_to_paragraphs()
) # group by label in cache file
+ cache_decompressed_dp = cache_decompressed_dp.map(lambda x: (x[0], x[1].encode()))
cache_decompressed_dp = cache_decompressed_dp.end_caching(
- mode="wt",
- filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x),
+ mode="wb",
+ filepath_fn=lambda x: os.path.join(root, decompressed_folder, split, x)
)I'm met with the following errors: |
|
The test seems OK if I pass |
It seems that |
@erip could you please push this change, so that we can close this PR in prep for branch-cut? Thanks |
|
Done! |
Partially address lingering TODO in #1493.
TODO: add IWSLT which fails due to XML parsing errors -- need to investigate.