User tests: Successful: Unsuccessful:
This adds unittests for the OutputFilter class. The code is half copied over from 3.x. Nothing special to look at. The chinese symbol should mean "test" and be a 4-byte-unicode character.
This is related to #25845.
Status | New | ⇒ | Pending |
Category | ⇒ | Unit Tests |
Labels |
Added:
?
?
|
Category | Unit Tests | ⇒ | Libraries Unit Tests |
Hint: Looks also like a simple 不
which is not in the range supposed to have problems, is not found.
In 3.x it was found but not highlighted.
tried using StringHelper
also for str_pad
$code = StringHelper::str_pad(dechex(StringHelper::ord($chr)), 4, '0', STR_PAD_LEFT);
but no apparent change.
Labels |
Added:
?
Removed: ? |
Thank you @infograf768 for your tests here. You are right, the cases for the folder names was wrong. Curse you, Windows! I fixed that. The OutputFilter class and the tests in here are correct, so this PR would be fine to merge, but there is indeed at least one serious bug in Smart Search regarding 5 byte characters. I will try to document what I could find out so far:
Indexing messes up 5 byte characters. Create a test article on an english site with "? ?". This will result in the later character being added to the index, the first one not. The characters are properly added to the #__finder_tokens table, but then those are not properly moved over to the #__finder_tokens_aggregate. It fails when doing a "SELECT DISTINCT term FROM #__finder_tokens" in the indexer in /administrator/components/com_finder/src/Indexer/Driver/Mysql.php line 262 No idea what to do here...
The search term is somehow messed up as well. When searching for ?, it matches on ?. It states that ? is required in the output, but then again asks the highlighter to highlight ?.
I will open new issues for this, but would like to ask that this PR is merged. The error described by @infograf768 is not in this part of the code.
Sorry, I messed up a bit above. It's not 5 byte chars, but 5 hex chars, thus 3 byte unicode characters. In any way, MySQL simply takes those 2 characters, compares them and thinks they are identical.
It seems there is really a problem with MySQL and these 2 chinese characters ? and ? . I've inserted with phpMyAdmin some records with these characters into a table in a varchar column with utf8mb4_unicode_ci collation on MySQL 8.0.19 and did a "SELECT DISTINCT ..." and got returned only one of them.
If the table has collation utf8mb4_0900_ai_ci
it works, but this collation is as far as I can see available on MySQL 8 but not on e.g. 5.7.
Possible explanations for that see https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/.
To me the changes in this PR seem to be correct.
The problems we have e.g. with certain Chinese characters seems to be a collation problem in MySQL (and MariaDB):
Now one could think "let's use utf8mb4_bin collation everywhere", but that might not be correct for a particular language. Really correct for that language regarding equivalence of certain UTF-8 characters or 2 character sequences like it is in German with ß
and ss
and regarding sorting would always only be the language-specific collation, and so at least on multilingual sites you always have to make some compromise.
@infograf768 Any thoughts?
@Hackwar @chrisdavenport Maybe it could make sense to use utf8mb4_bin collation for the finder_tokens and the finder_tokens_aggregate table, or at least for particular coluns of these tables, e.g. term? See my previous comment about the difference between some collations.
This is exactly why I have always said that we should keep com_search... It is a viable alternative for some languages.
See #25845 (comment) concerning com_search
@richard67 yes postgresql works just fine as usual
the test table
CREATE TABLE IF NOT EXISTS "chtest" (
"id" serial NOT NULL,
"col1" varchar(50) NOT NULL,
PRIMARY KEY ("id")
);
the test data
insert into "chtest" values (1,'?');
insert into "chtest" values (2,'?');
insert into "chtest" values (3,'Ä');
insert into "chtest" values (4,'ä');
insert into "chtest" values (5,'Ë');
insert into "chtest" values (6,'ë');
insert into "chtest" values (7,'Ï');
insert into "chtest" values (8,'ï');
insert into "chtest" values (9,'Ö');
insert into "chtest" values (10,'ö');
insert into "chtest" values (11,'Ü');
insert into "chtest" values (12,'ü');
insert into "chtest" values (13,'Ÿ');
insert into "chtest" values (14,'ÿ');
insert into "chtest" values (15,'å');
insert into "chtest" values (16,'æé');
insert into "chtest" values (17,'ø');
thet test query
select distinct col1 from chtest;
Status | Pending | ⇒ | Fixed in Code Base |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2020-04-06 08:12:53 |
Closed_By | ⇒ | rdeutz | |
Labels |
Added:
?
Removed: ? |
Note
The PR does not respect case for the test part
we have
a/tests/Unit/libraries/cms/Filter/OutputFilterTest.php
which should be
a/tests/Unit/Libraries/Cms/Filter/OutputFilterTest.php
Before patch
Interesting results in 4.0 vs 3.x (See #25845 )

Although the 4bytes character
?
is not highlighted, we now get a result.And, contrary to 3.x, the string

不能创建文件
is highlighted.After patch
We do get a highlighted single 4 bytes character BUT it is not the correct one.

Looking for
?
, we get?
highlighted.