Joomla! Issue Tracker | Joomla! CMS #41502 - [5.0] Smart Search: Use UTF8-aware functions when indexing

PR-5.0-dev

Pending
Medium
Build: 5.0-dev
# 41502
Diff
Hackwar:5.0-finder-utf8parsing

Pending

User tests: Successful: Unsuccessful:

Hackwar
28 Aug 2023

Pull Request for Issue #40543.

Summary of Changes

The parsing and processing code so far used non-utf8-safe functions like substr(), strrpos(), etc. This resulted in broken parsing in some cases and thus the behavior of #40543.

Testing Instructions

Please copy the text from #40543 into an article and replace the p-tags between the paragraphs with br-tags instead. Save the content. If you look at the entry in #__finder_links for that content item, you will see the content in the description column and some words have been pushed together, removing the space between them.

Actual result BEFORE applying this Pull Request

No space between some words.

Expected result AFTER applying this Pull Request

All words are seperated with one space.

Link to documentations

Please select:

Documentation link for docs.joomla.org:
No documentation changes for docs.joomla.org needed
Pull Request link for manual.joomla.org:
No documentation changes for manual.joomla.org needed

aec9621 28 Aug 2023

Smart Search: Use UTF8-aware functions when indexing

joomla-cms-bot - change - 28 Aug 2023

Category

⇒

Administration com_finder

Hackwar - open - 28 Aug 2023

Hackwar - change - 28 Aug 2023

Status

New

⇒

Pending

d5375e7 30 Aug 2023

Merge branch '5.0-dev' into 5.0-finder-utf8parsing

Hackwar - change - 30 Aug 2023

Labels

Added: PR-5.0-dev

HLeithner - comment - 30 Aug 2023

we read 2048 bytes in per fread, couldn't this break utf8 encoded characters?

Hackwar - comment - 30 Aug 2023

If that is the case (I'm not saying that it is) we already have that problem and this PR wouldn't exactly change that situation. But I think that it is not a problem, because the code looks for the last space in the read block and would load any missing bytes after that. More interesting in my opinion would be to raise the amount of read data. Right now its 2kb, but I think we should raise that to at least 8kb, more likely something along 20kb...

HLeithner - comment - 31 Aug 2023

If that is the case (I'm not saying that it is) we already have that problem and this PR wouldn't exactly change that situation. But I think that it is not a problem, because the code looks for the last space in the read block and would load any missing bytes after that. More interesting in my opinion would be to raise the amount of read data. Right now its 2kb, but I think we should raise that to at least 8kb, more likely something along 20kb...

maybe you are right hitting a character code 20 in an utf string at that position might be unlikely.
Increasing the length could help too, don't know why the limit is only 2kb wouled expect at 4k

Hackwar - comment - 31 Aug 2023

I will create a seperate PR to increase that parsing limit to a higher number. I would use 8kb for the time being.

HLeithner - comment - 2 Sep 2023

thanks

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#41502] - [5.0] Smart Search: Use UTF8-aware functions when indexing