Joomla! Issue Tracker | Joomla! CMS #23647 - content search and searching Special Characters (HTML, UTF-8)

?

Closed
10 Feb 2019
Medium
Build: staging
# 23647

kofaysi
23 Jan 2019

content search and searching Special Characters (HTML, UNICODE)

Steps to reproduce the issue

Create two test articles (or reuse two existing articles)
Add a word in introtext or fulltext field for these two articles -- in the SQL background. Do not enter these words into the code using your code editor in the administration background -- most of them convert coding.

Add the word stáhnout into one of the articles
Add the word stáhnout into the other article
Optional: Choose a different word, which is not likely used in your contents, e.g. čeština and &ccaron;e&scaron;tina, maďarčina and ma&dcaron;ar&ccaron;ina

Search for both versions of the chosen words in the administrator site, e.g. content:stáhnout and content:stáhnout.

Update search/finder index, if necessary.

Expected result

Both articles containing the same word (in meaning) are found for both searches (HTML and UNICODE)

Actual result

Both searches displays only one article at the time.

System information (as much as possible)

Joomla 3.9.2

Additional comments

kofaysi - open - 23 Jan 2019

joomla-cms-bot - change - 23 Jan 2019

Labels

Added: ?

joomla-cms-bot - labeled - 23 Jan 2019

ReLater - comment - 23 Jan 2019

Why do you use entities? Is there any reason to use them?

kofaysi - comment - 23 Jan 2019

I have a bunch of articles at earlier times, around e.g. 2017-03-30 (Joomla always updated on the public release version) in which these entities occur. Some period later and new articles do not contain html entities, texts are unicode or utf-8 encoded.

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.}

kofaysi - comment - 23 Jan 2019

I have a bunch of articles at earlier times, around e.g. 2017-03-30 (Joomla always updated on the public release version) in which these entities occur. Some period later and new articles do not contain html entities, texts are unicode or utf-8 encoded.

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.}

kofaysi - comment - 23 Jan 2019

By opening and saving such article in the administrator site, the encoding changes, from html to unicode(?).

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.}

kofaysi - comment - 23 Jan 2019

By opening and closing such article in the administrator site, the encoding changes, from html to unicode(?).

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.}

kofaysi - comment - 23 Jan 2019

The creation date of these articles containing html entities is ranging from 2015-10-06 to 2017-04-25. I have no clue, if I have changed some Joomla configuration afterwards or the JCE Editor changed or Joomla core changed after this or later date.

kofaysi - comment - 24 Jan 2019

This is not the case for searching in frontend. A search with unicode characters returns also the phrase with html entities. A search with html entities is converted to unicode (in the search field) and returns also unicode entities.

brianteeman - comment - 24 Jan 2019

Search for both versions of the chosen words in the administrator site, e.g. content:stáhnout and content:stáhnout.

I assume you are referring to using the search above the list of content items? The ability to search inside content was only recently added. It does not use finder. It is just a basic mysql search.

infograf768 - comment - 24 Jan 2019

@kofaysi
Can you try this?
Modify line 341 of file /administrator/components/com_content/models/articles.php
from

$query->where('(a.introtext LIKE ' . $search . ' OR a.fulltext LIKE ' . $search . ')');

to

$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search) . ' OR a.introtext LIKE ' . html_entity_decode($search) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search) . ' OR a.fulltext LIKE ' . html_entity_decode($search) . ')');
Not sure it is the correct way to solve the issue but it looks like working here.

kofaysi - comment - 24 Jan 2019

I changed lines and copied the file back and tried the search. The behavior did not change. The search still finds only one of the two articles in question. I also logged out and logged in. No changes. Should I do anything more?

kofaysi - comment - 24 Jan 2019

The line suggested, n. 341, does influence the search, but using only
$query->where('(a.introtext LIKE ' . html_entity_decode($search) . ')');
gives exactly the same results as using the suggested complex new line by @infograf768. In this simple case, I would expect that unicode encoded words wouldn't yield any result, because they are going toe be html encoded. But they do find the correct unicode encoded article.

Performing these searches on the SQL server directly gives an interesting findings. Searching for "maďarčina" gives only a single results from the two articles, the unicode/utf8 one. But the query contains (among others)
CONVERT(introtext USING utf8) LIKE '%maďarčina%'
That is: the introtext should be converted from HTML entities to utf8 and only after this conversion searched or checked. The article-introtext containing "maďarčina" has not been found in this case.

infograf768 - comment - 25 Jan 2019

The code I posted takes into account:

Entering stáhnout in the search box and therefore finding articles who contain stáhnout AND stáhnout
Entering stáhnout in the search box and finding the articles who contain stáhnout AND stáhnout

Therefore it does work here.

kofaysi - comment - 25 Jan 2019

I'm reporting about searching for various words and their parts. Some searches were successful, some of them weren't. It seems like some of the characters are converted, some of them can't be converted, UTF8 <-> HTML.

stáhnout and st\áhnout

OK

maďarčina and ma&dcaron;ar&ccaron;ina

NOT OK
NOT OK searching for only part of the word

čeština and \&ccaron;e\&scaron;tina

NOT OK

přitáhnout and p&rcaron;itáhnout

NOT OK
OK searching for part of it
NOT OK searching for other part of it
NOT OK searching for other part of it

(other articles found, but no HTML entities article among the results)
NOT OK searching only for ASCII (it should not find anything, right?)

kofaysi - comment - 25 Jan 2019

I tried to modify the code using htmlentities($search, 'UTF-8') and `htmlentities($search, "UTF-8") with no luck: The code was not accepted by Joomla.

kofaysi - comment - 25 Jan 2019

But html_entity_decode() does something. This search is successful with the code by @infograf768 :

I modified the code to

$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search, ENT_HTML5, "utf-8") . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search, ENT_HTML5, "utf-8") . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');

and using html_entity_decode($search, ENT_HTML5) works, see below for the whole Czech alphabet. I cannot get the parameters for htmlentities() right, though.

We found out that there is some problem converting č, ř, ď and other characters to HTML entities. So, now I'm going for all the Czech characters by using the Czech alphabet: aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
and
aábc&ccaron;d&dcaron;eé&ecaron;fghchiíjklmn&ncaron;oópqr&rcaron;s&scaron;t&tcaron;uú&uring;vwxyýz&zcaron;AÁBC&Ccaron;D&Dcaron;EÉ&Ecaron;FGHChIÍJKLMN&Ncaron;OÓPQR&Rcaron;S&Scaron;T&Tcaron;UÚ&Uring;VWXYÝZ&Zcaron;
search for the HTML variant content:aábc&ccaron;d&dcaron;eé&ecaron;fghchiíjklmn&ncaron;oópqr&rcaron;s&scaron;t&tcaron;uú&uring;vwxyýz&zcaron;AÁBC&Ccaron;D&Dcaron;EÉ&Ecaron;FGHChIÍJKLMN&Ncaron;OÓPQR&Rcaron;S&Scaron;T&Tcaron;UÚ&Uring;VWXYÝZ&Zcaron;
is successful
But I cannot get the conversion the other way right. Search for the UFT-8 variant content:aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ

failed.

kofaysi - comment - 25 Jan 2019

This is really strange. Using htmlentities($search, ENT_HTML5) all special characters , e.g. á, č, ď, are converted to their ASCII variants: e.g. a, c, d. It was found out by searching for áčď and finding articles with acd.

So the closest I can get is using

$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search) . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search) . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');

fully compliant with á, é, í, ó, ú characters, but not with other special characters.

infograf768 - comment - 26 Jan 2019

Yeah, it looks like nothing works for some htmlentities.
The only viable solution I see for you is to edit and save again the articles containing htmlentities in db.
Boring but necessary.
BBEdit would do that fast for you on a dump of the _content table, but beware as some htmlentities you are using may not be compatible. Example &ccaron; would not be decoded, but č or č will correctly to č.

kofaysi - change - 26 Jan 2019

Title

…

content search and searching Special Characters (HTML, U~~NICODE~~)

content search and searching Special Characters (HTML, UTF-8)

kofaysi - edited - 26 Jan 2019

kofaysi - comment - 26 Jan 2019

OK, I understand. Can you suggest any place, where I could report this unexpected behavior? (There is a PHP sandbox on the net, where this conversion works perfectly.)

kofaysi - comment - 27 Jan 2019

On line 340 two percent symbols % are added to the search string from left and from right. Those two symbols are converted to \&percnt; by htmlentities($search, ENT_HTML5). Example: The query \&percnt;foo\&percnt; is searched in the SQL DB Instead of %foo%, when searching content:foo.

I came up with this brute-force solution then:
$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . '\'%' . htmlentities(trim(stripslashes(substr($search, 1, -1)), '%'), ENT_HTML5) . '%\'' . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . '\'%' . htmlentities(trim(stripslashes(substr($search, 1, -1)), '%'), ENT_HTML5) . '%\'' . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');

kofaysi - comment - 6 Feb 2019

Do you think it is worth a PR?
I do not understand how I could get half of the SQL DB in HTML and half of it UTF-8. But there might be users they are not aware of this at all, their search just does not work sometimes as expected. On the other hand, the suggested solution is only a part of the problem: What about non-content searches, i.e. searches within Titles, Authors, Notes? That query should be corrected, too, right?

HLeithner - comment - 6 Feb 2019

I think it makes no sense to try to convert the search query in every possible encoding supported on this planet. In my opinion its a problem with your data and maybe a used RCE Editor that converted a part of the content to a wrong encoding.

It's better you fix the your database instead of trying to push a workaround in to the core.

You write about tiles and authors but have thease fields the same problem?

kofaysi - comment - 10 Feb 2019

I understand. Thank you for fixing my search. I'll check my older articles.

kofaysi - change - 10 Feb 2019

Status	New	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2019-02-10 19:45:53
Closed_By		⇒	kofaysi

kofaysi - close - 10 Feb 2019

Add a Comment

Login with GitHub to post a comment

Older
Newer