Joomla! Issue Tracker | Joomla! CMS #28493 - [4.0] Add OutputFilter tests

? ?

Fixed in Code Base
6 Apr 2020
Medium
Build: 4.0-dev
# 28493
Diff
Hackwar:j4testoutput

Pending

Pending Hound Hound is busy sniffing around... Details

User tests: Successful: Unsuccessful:

Hackwar
28 Mar 2020

This adds unittests for the OutputFilter class. The code is half copied over from 3.x. Nothing special to look at. The chinese symbol should mean "test" and be a 4-byte-unicode character.

This is related to #25845.

Hackwar - open - 28 Mar 2020

Hackwar - change - 28 Mar 2020

Status

New

⇒

Pending

joomla-cms-bot - change - 28 Mar 2020

Note

The PR does not respect case for the test part
we have
a/tests/Unit/libraries/cms/Filter/OutputFilterTest.php
which should be
a/tests/Unit/Libraries/Cms/Filter/OutputFilterTest.php

Before patch

Interesting results in 4.0 vs 3.x (See #25845 )
Although the 4bytes character ? is not highlighted, we now get a result.

And, contrary to 3.x, the string 不能创建文件 is highlighted.

After patch

We do get a highlighted single 4 bytes character BUT it is not the correct one.
Looking for ?, we get ? highlighted.

infograf768 - comment - 31 Mar 2020

Hint: Looks also like a simple 不 which is not in the range supposed to have problems, is not found.
In 3.x it was found but not highlighted.

infograf768 - comment - 31 Mar 2020

tried using StringHelper also for str_pad
$code = StringHelper::str_pad(dechex(StringHelper::ord($chr)), 4, '0', STR_PAD_LEFT);

but no apparent change.

002fee6 4 Apr 2020

Adding tests for OutputFilter

Hackwar - change - 4 Apr 2020

Labels

Added: ?
Removed: ?

Hackwar - comment - 4 Apr 2020

Thank you @infograf768 for your tests here. You are right, the cases for the folder names was wrong. Curse you, Windows! I fixed that. The OutputFilter class and the tests in here are correct, so this PR would be fine to merge, but there is indeed at least one serious bug in Smart Search regarding 5 byte characters. I will try to document what I could find out so far:

Indexing messes up 5 byte characters. Create a test article on an english site with "? ?". This will result in the later character being added to the index, the first one not. The characters are properly added to the #__finder_tokens table, but then those are not properly moved over to the #__finder_tokens_aggregate. It fails when doing a "SELECT DISTINCT term FROM #__finder_tokens" in the indexer in /administrator/components/com_finder/src/Indexer/Driver/Mysql.php line 262 No idea what to do here...
The search term is somehow messed up as well. When searching for ?, it matches on ?. It states that ? is required in the output, but then again asks the highlighter to highlight ?.

I will open new issues for this, but would like to ask that this PR is merged. The error described by @infograf768 is not in this part of the code.

Hackwar - comment - 4 Apr 2020

Sorry, I messed up a bit above. It's not 5 byte chars, but 5 hex chars, thus 3 byte unicode characters. In any way, MySQL simply takes those 2 characters, compares them and thinks they are identical.

richard67 - comment - 4 Apr 2020

It seems there is really a problem with MySQL and these 2 chinese characters ? and ? . I've inserted with phpMyAdmin some records with these characters into a table in a varchar column with utf8mb4_unicode_ci collation on MySQL 8.0.19 and did a "SELECT DISTINCT ..." and got returned only one of them.

richard67 - comment - 4 Apr 2020

If the table has collation utf8mb4_0900_ai_ci it works, but this collation is as far as I can see available on MySQL 8 but not on e.g. 5.7.

richard67 - comment - 4 Apr 2020

Possible explanations for that see https://mysqlserverteam.com/mysql-8-0-collations-the-devil-is-in-the-details/.

richard67 - comment - 4 Apr 2020

@alikon If you check my 3 previous comments I am 100 % sure about what you will say: With PostgreSQL we don't have that problem ;-)

richard67 - comment - 5 Apr 2020

To me the changes in this PR seem to be correct.

The problems we have e.g. with certain Chinese characters seems to be a collation problem in MySQL (and MariaDB):

Now one could think "let's use utf8mb4_bin collation everywhere", but that might not be correct for a particular language. Really correct for that language regarding equivalence of certain UTF-8 characters or 2 character sequences like it is in German with ß and ssand regarding sorting would always only be the language-specific collation, and so at least on multilingual sites you always have to make some compromise.

@infograf768 Any thoughts?

richard67 - comment - 5 Apr 2020

@Hackwar @chrisdavenport Maybe it could make sense to use utf8mb4_bin collation for the finder_tokens and the finder_tokens_aggregate table, or at least for particular coluns of these tables, e.g. term? See my previous comment about the difference between some collations.

infograf768 - comment - 5 Apr 2020

This is exactly why I have always said that we should keep com_search... It is a viable alternative for some languages.

infograf768 - comment - 6 Apr 2020

See #25845 (comment) concerning com_search

alikon - comment - 6 Apr 2020

@richard67 yes postgresql works just fine as usual ?

the test table

CREATE TABLE IF NOT EXISTS "chtest" (
  "id" serial NOT NULL,
  "col1" varchar(50) NOT NULL,
  PRIMARY KEY ("id")
);

the test data

insert into "chtest" values (1,'?');
insert into "chtest" values (2,'?');
insert into "chtest" values (3,'Ä');
insert into "chtest" values (4,'ä');
insert into "chtest" values (5,'Ë');
insert into "chtest" values (6,'ë');
insert into "chtest" values (7,'Ï');
insert into "chtest" values (8,'ï');
insert into "chtest" values (9,'Ö');
insert into "chtest" values (10,'ö');
insert into "chtest" values (11,'Ü');
insert into "chtest" values (12,'ü');
insert into "chtest" values (13,'Ÿ');
insert into "chtest" values (14,'ÿ');
insert into "chtest" values (15,'å');
insert into "chtest" values (16,'æé');
insert into "chtest" values (17,'ø');

thet test query

select distinct col1 from chtest;

the results

rdeutz - change - 6 Apr 2020

Status	Pending	⇒	Fixed in Code Base
Closed_Date	0000-00-00 00:00:00	⇒	2020-04-06 08:12:53
Closed_By		⇒	rdeutz
Labels	Added: ? Removed: ?

rdeutz - close - 6 Apr 2020

rdeutz - merge - 6 Apr 2020

infograf768 - comment - 6 Apr 2020

Please test #28587

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#28493] - [4.0] Add OutputFilter tests

Note

Before patch

After patch

Add a Comment