? Pending

User tests: Successful: Unsuccessful:

avatar Hackwar
Hackwar
14 Aug 2019

The previous code doesn't really support non-english characters and for example fails on german umlauts. This code is used in com_finder, so when you are searching for something like "Europäer", you wont get any highlighting for the searched term in the search results.

How to test

  1. Enable Smart Search content plugin
  2. Edit/create an article with the word "Europäer"
  3. Search for that word in the frontend
  4. See that the word "Europäer" is not highlighted in the search results
  5. Apply patch
  6. Search again and now see it being highlighted.
avatar Hackwar Hackwar - open - 14 Aug 2019
avatar Hackwar Hackwar - change - 14 Aug 2019
Status New Pending
avatar joomla-cms-bot joomla-cms-bot - change - 14 Aug 2019
Category Libraries
avatar brianteeman
brianteeman - comment - 14 Aug 2019

This is something we really should have a test for. Specifically to ensure it not only supports accented characters in latin characters sets but also in non latin character sets and multibyte character sets.

avatar infograf768 infograf768 - test_item - 15 Aug 2019 - Tested successfully
avatar infograf768
infograf768 - comment - 15 Aug 2019

Although it does not always work (see below), it is a great improvement.
I have tested this item successfully on 3ea6a47


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar infograf768
infograf768 - comment - 15 Aug 2019

I have tested this item successfully on 3ea6a47


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar infograf768
infograf768 - comment - 15 Aug 2019

Example with Chinese

Screen Shot 2019-08-15 at 08 17 52

avatar infograf768
infograf768 - comment - 15 Aug 2019

Note: It does not work for single 4 bytes characters.
Example: ?

Edit:
"They are in the Supplementary Plane and will test support for code points above U+FFFF"
See: http://www.i18nguy.com/unicode/supplementary-test.html

avatar infograf768
infograf768 - comment - 15 Aug 2019

Dumping some of these 4 bytes characters is correct though

 Assuming ? is required, the following results were found.
string(7) "\u20e9d" 
avatar wilsonge
wilsonge - comment - 15 Aug 2019

Agree with @brianteeman this is an easy function to test as it's static and no dependencies.

avatar richard67 richard67 - test_item - 15 Aug 2019 - Tested successfully
avatar richard67
richard67 - comment - 15 Aug 2019

I have tested this item successfully on 3ea6a47

Tested with "Überlingen", "Europäer", "Музыка", "Curaçao" and "香肠"


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar richard67
richard67 - comment - 15 Aug 2019

@infograf768 Just I noticed your posts above. Does it mean your test was not good? Should that work, 4-byte UTF-8? I know we did much for MySQL to support utf8mb4, and PHP should be ok, but maybe we have forgotten something in js?


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar richard67 richard67 - test_item - 15 Aug 2019 - Not tested
avatar richard67
richard67 - comment - 15 Aug 2019

I have not tested this item.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar richard67 richard67 - test_item - 15 Aug 2019 - Tested successfully
avatar richard67
richard67 - comment - 15 Aug 2019

I have tested this item successfully on 3ea6a47

Agree with JM: A big improvement compared to now.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar Fedik
Fedik - comment - 15 Aug 2019

hm, can just be?

return json_encode($string);
avatar infograf768
infograf768 - comment - 16 Aug 2019

@Fedik
with json_encode($new_str) we lose highlighting.

@richard67
In my posts above I was just wondering why it does not work fully with 4 bytes glyphs when isolated.

FYI, this code is already present in 4.0 and we get the same results as here.

avatar richard67
richard67 - comment - 23 Aug 2019

@Hackwar Is this code also used by the media manager? If so: Could it be that it solves issue #25997 ?
Edit: No, it seems to be used by the highlight behavior. I don't think that is used by media manager. But it seems media manager has a problem similar to the one handled in this PR here.

avatar brianteeman
brianteeman - comment - 23 Aug 2019

Richard. The media issue is completely different and a large part of that is the filesystem

avatar Hackwar Hackwar - change - 21 Oct 2019
Labels Added: ?
avatar Hackwar
Hackwar - comment - 21 Oct 2019

I'll put writing tests for this on my todo list, but can we still merge this for 4.0, @wilsonge?

avatar Quy Quy - change - 27 Nov 2019
Status Pending Ready to Commit
avatar Quy
Quy - comment - 27 Nov 2019

RTC


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/25845.

avatar Quy Quy - change - 27 Nov 2019
Labels Added: ?
avatar HLeithner
HLeithner - comment - 5 Dec 2019

@infograf768 does this support mb4?

@Hackwar you wrote this pr against J3, do you had time to write some tests for this?

avatar infograf768
infograf768 - comment - 6 Dec 2019

@infograf768 does this support mb4?

yes, when we do not search an isolated utf8mb4 glyph
see above : #25845 (comment)

avatar HLeithner
HLeithner - comment - 6 Dec 2019

ok so can we fix this case for single mb4 character and have a test for this @Hackwar ?

Then we can fix this issue in staging. Thanks

avatar Hackwar
Hackwar - comment - 7 Dec 2019

Sorry, I would really love to help you right now, but I'm really swamped with work in the next few weeks. I doubt that I can get to this before christmas. If anybody wants to make a PR against my branch, that would be appreciated.

avatar HLeithner
HLeithner - comment - 11 Dec 2019

Since this is out JS safe output filter function I would request some feedback from @joomla/security and a complete PR with the mentioned single mb4 character.

avatar infograf768 infograf768 - change - 12 Dec 2019
Labels Added: ?
avatar Hackwar Hackwar - change - 28 Mar 2020
Labels Added: ?
Removed: ? ?
avatar joomla-cms-bot joomla-cms-bot - change - 28 Mar 2020
Category Libraries Libraries Unit Tests
avatar Hackwar Hackwar - change - 28 Mar 2020
Labels Added: ? ? ?
Removed: ?
avatar Hackwar
Hackwar - comment - 28 Mar 2020

I added a unit test. @infograf768 can you have a look if I did this right with the utf8mb4 glyph?

avatar infograf768
infograf768 - comment - 28 Mar 2020

Will test tomorrow

avatar infograf768
infograf768 - comment - 29 Mar 2020

Still no result for unique glyph like
?

And we now lose highlighting for
不能创建文件

Screen Shot 2020-03-29 at 09 35 30

i.e. it is a regression vs your original PR

avatar Hackwar Hackwar - change - 29 Mar 2020
Labels Added: ?
Removed: ? ?
avatar Hackwar
Hackwar - comment - 29 Mar 2020

I tested with your strings and the second string did work, while the first indeed did not work. I looked into this and again noticed that I'm not a JS developer. Javascript uses UCS-2 to encode its strings and that means that the code we've been using will only work for Unicode characters up to U+FFFF. Everything above would have to be converted in a way that I found too complicated for me to grasp. I took the cheap way out and for such characters I'm now using ECMAScript 6 notation with curly braces around the escape sequence. I did not want to use that for all characters, since I fear that that would be a break in B/C, since it would require that every browser would have to support ECMAScript. Since this feature didn't work at all up till now, I think it would be okay to require that compatibility for higher codepoints... What do you think?

avatar Hackwar Hackwar - change - 29 Mar 2020
Labels Added: ?
Removed: ?
avatar Quy Quy - change - 29 Mar 2020
Status Ready to Commit Pending
avatar infograf768
infograf768 - comment - 30 Mar 2020

不能创建文件 is still not highlighted here (same result as above #25845 (comment) ) while it is when we use com_search

com_search
Screen Shot 2020-03-30 at 08 46 23

avatar infograf768
infograf768 - comment - 31 Mar 2020

See also my tests for j4 #28493

avatar wilsonge
wilsonge - comment - 31 Mar 2020

@Hackwar as long as you can do some feature detection i think it should be ok to use progressive enhancement

avatar richard67
richard67 - comment - 5 Apr 2020

@infograf768 The Chinese problem could come from a collation problem in MySQL, see my comments in the J4 PR: #28493 (comment) and previous.

avatar infograf768
infograf768 - comment - 5 Apr 2020

@richard67
read. We should keep com_search

avatar richard67
richard67 - comment - 5 Apr 2020

@infograf768 I'd like to keep com_search too, at least as alternative for small sites like my private site.

avatar Hackwar
Hackwar - comment - 5 Apr 2020

However this has nothing to do with com_search or com_finder. You have the exact same issue in com_search as well. It is an issue with the DB and the bad support for chinese characters.

avatar Hackwar
Hackwar - comment - 5 Apr 2020

@wilsonge @HLeithner Are you saying that you are requiring some check if this new notation is allowed to be used? Because I don't see how that could be done... Considering that it is broken right now and this would fix it for all users except for <IE11, I would accept this change as is...

avatar infograf768
infograf768 - comment - 6 Apr 2020

@Hackwar

You have the exact same issue in com_search as well.

Wrong.
Once you set getLowerLimitSearchWord()to 1 instead of 3, the correct result is obtained and highlighted.

Screen Shot 2020-04-06 at 08 38 11

avatar infograf768
infograf768 - comment - 6 Apr 2020

OK to merge. Won't solve the issue in J3 as it would require a lot of work to make a new conversion. Also, as com_search works fine, there is an alternate solution.

For J4, see #28587

avatar HLeithner HLeithner - change - 8 Apr 2020
Status Pending Fixed in Code Base
Closed_Date 0000-00-00 00:00:00 2020-04-08 10:08:36
Closed_By HLeithner
Labels Removed: ? ?
avatar HLeithner HLeithner - close - 8 Apr 2020
avatar HLeithner HLeithner - merge - 8 Apr 2020
avatar HLeithner
HLeithner - comment - 8 Apr 2020

Thanks

Add a Comment

Login with GitHub to post a comment