Skip to content

Conversation

zeripath
Copy link
Contributor

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton [email protected]

…esenting utf-8

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <[email protected]>
@zeripath
Copy link
Contributor Author

zeripath commented May 21, 2022

Strangely I'm having some difficulty creating a test that replicates this issue from within the charset module.

I'm not certain as to what's going on that means that I can't replicate this.


I've been able to add a testcase.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label May 21, 2022
@wxiaoguang
Copy link
Contributor

See my comment in #19743, there is a test case.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels May 21, 2022
@codecov-commenter
Copy link

codecov-commenter commented May 21, 2022

Codecov Report

❗ No coverage uploaded for pull request base (main@876cad0). Click here to learn what that means.
The diff coverage is 85.00%.

❗ Current head 72a0a45 differs from pull request most recent head 928f95d. Consider uploading reports for the commit 928f95d to get more accurate results

@@           Coverage Diff           @@
##             main   #19773   +/-   ##
=======================================
  Coverage        ?   47.29%           
=======================================
  Files           ?      957           
  Lines           ?   133317           
  Branches        ?        0           
=======================================
  Hits            ?    63058           
  Misses          ?    62599           
  Partials        ?     7660           
Impacted Files Coverage Δ
modules/charset/charset.go 71.73% <85.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 876cad0...928f95d. Read the comment docs.

@lunny
Copy link
Member

lunny commented May 21, 2022

#19743

It's better to include this test case.

@zeripath
Copy link
Contributor Author

#19743

It's better to include this test case.

I've already included a specific test case but I can add that if you really want it.

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels May 21, 2022
@zeripath zeripath merged commit bc4764f into go-gitea:main May 21, 2022
@zeripath zeripath deleted the fix-19743-improve-encoding-detection branch May 21, 2022 13:06
zeripath added a commit to zeripath/gitea that referenced this pull request May 21, 2022
…esenting utf-8 (go-gitea#19773)

Backport go-gitea#19773

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <[email protected]>
lunny pushed a commit that referenced this pull request May 21, 2022
…esenting utf-8 (#19773) (#19774)

Backport #19773

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix #19743

Signed-off-by: Andrew Thornton <[email protected]>
zjjhot added a commit to zjjhot/gitea that referenced this pull request May 21, 2022
* giteaofficial/main:
  Prevent NPE when cache service is disabled (go-gitea#19703)
  Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773)
  Add silentcodeg to MAINTAINERS (go-gitea#19771)
  Allows repo search to match against "owner/repo" pattern strings (go-gitea#19754)
  Update JS dependencies (go-gitea#19767)
  Nuke the incorrect permission report on /api/v1/notifications (go-gitea#19761)
zeripath added a commit to zeripath/gitea that referenced this pull request Jun 20, 2022
## [1.16.9](https://github.com/go-gitea/gitea/releases/tag/1.16.9) - 2022-06-20

* BUGFIXES
  * Fix permission check for delete tag (go-gitea#19985) (go-gitea#20001)
  * Only log non ErrNotExist errors in git.GetNote  (go-gitea#19884) (go-gitea#19905)
  *  Use exact search instead of fuzzy search for branch filter dropdown (go-gitea#19885) (go-gitea#19893)
  * Set Setpgid on child git processes (go-gitea#19865) (go-gitea#19881)
  * Import git from alpine 3.16 repository as 2.30.4 is needed for `safe.directory = '*'` to work but alpine 3.13 has 2.30.3 (go-gitea#19876)
  * Ensure responses are context.ResponseWriters (go-gitea#19843) (go-gitea#19859)
  * Fix count bug (go-gitea#19850)
  * Fix raw endpoint PDF file headers (go-gitea#19825) (go-gitea#19826)
  * Make WIP prefixes case insensitive, e.g. allow `Draft` as a WIP prefix (go-gitea#19780) (go-gitea#19811)
  * Fix NotificationUnreadCount (go-gitea#19802)
  * Prevent NPE when cache service is disabled (go-gitea#19703) (go-gitea#19783)
  * Detect truncated utf-8 characters at the end of content as still representing utf-8 (go-gitea#19773) (go-gitea#19774)
  * Fix doctor pq: syntax error at or near "." quote user table name (go-gitea#19765) (go-gitea#19770)
  * Fix bug (go-gitea#19757)

Signed-off-by: Andrew Thornton <[email protected]>
@zeripath zeripath mentioned this pull request Jun 20, 2022
AbdulrhmnGhanem pushed a commit to kitspace/gitea that referenced this pull request Aug 24, 2022
…esenting utf-8 (go-gitea#19773)

Our character detection algorithm can potentially incorrectly detect utf-8 as iso-8859-x
if there is a truncated character at the end of the partially read file.

This PR changes the detection algorithm to truncated utf8 characters at the end of the
buffer.

Fix go-gitea#19743

Signed-off-by: Andrew Thornton <[email protected]>
@wxiaoguang wxiaoguang mentioned this pull request Sep 24, 2022
1 task
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong display of cyrillic symbols in UTF-8 file
5 participants