PR-5.0-dev Pending

User tests: Successful: Unsuccessful:

avatar Hackwar
Hackwar
28 Aug 2023

Pull Request for Issue #40543.

Summary of Changes

The parsing and processing code so far used non-utf8-safe functions like substr(), strrpos(), etc. This resulted in broken parsing in some cases and thus the behavior of #40543.

Testing Instructions

Please copy the text from #40543 into an article and replace the p-tags between the paragraphs with br-tags instead. Save the content. If you look at the entry in #__finder_links for that content item, you will see the content in the description column and some words have been pushed together, removing the space between them.

Actual result BEFORE applying this Pull Request

No space between some words.

Expected result AFTER applying this Pull Request

All words are seperated with one space.

Link to documentations

Please select:

  • Documentation link for docs.joomla.org:

  • No documentation changes for docs.joomla.org needed

  • Pull Request link for manual.joomla.org:

  • No documentation changes for manual.joomla.org needed

avatar joomla-cms-bot joomla-cms-bot - change - 28 Aug 2023
Category Administration com_finder
avatar Hackwar Hackwar - open - 28 Aug 2023
avatar Hackwar Hackwar - change - 28 Aug 2023
Status New Pending
avatar Hackwar Hackwar - change - 30 Aug 2023
Labels Added: PR-5.0-dev
avatar HLeithner
HLeithner - comment - 30 Aug 2023

we read 2048 bytes in per fread, couldn't this break utf8 encoded characters?

avatar Hackwar
Hackwar - comment - 30 Aug 2023

If that is the case (I'm not saying that it is) we already have that problem and this PR wouldn't exactly change that situation. But I think that it is not a problem, because the code looks for the last space in the read block and would load any missing bytes after that. More interesting in my opinion would be to raise the amount of read data. Right now its 2kb, but I think we should raise that to at least 8kb, more likely something along 20kb...

avatar HLeithner
HLeithner - comment - 31 Aug 2023

If that is the case (I'm not saying that it is) we already have that problem and this PR wouldn't exactly change that situation. But I think that it is not a problem, because the code looks for the last space in the read block and would load any missing bytes after that. More interesting in my opinion would be to raise the amount of read data. Right now its 2kb, but I think we should raise that to at least 8kb, more likely something along 20kb...

maybe you are right hitting a character code 20 in an utf string at that position might be unlikely.
Increasing the length could help too, don't know why the limit is only 2kb wouled expect at 4k

avatar Hackwar
Hackwar - comment - 31 Aug 2023

I will create a seperate PR to increase that parsing limit to a higher number. I would use 8kb for the time being.

avatar HLeithner
HLeithner - comment - 2 Sep 2023

thanks

Add a Comment

Login with GitHub to post a comment