?
avatar kofaysi
kofaysi
23 Jan 2019

content search and searching Special Characters (HTML, UNICODE)

Steps to reproduce the issue

  1. Create two test articles (or reuse two existing articles)
  2. Add a word in introtext or fulltext field for these two articles -- in the SQL background. Do not enter these words into the code using your code editor in the administration background -- most of them convert coding.
  • Add the word stáhnout into one of the articles
  • Add the word stáhnout into the other article
  • Optional: Choose a different word, which is not likely used in your contents, e.g. čeština and čeština, maďarčina and maďarčina
  1. Search for both versions of the chosen words in the administrator site, e.g. content:stáhnout and content:stáhnout.

Update search/finder index, if necessary.

Expected result

Both articles containing the same word (in meaning) are found for both searches (HTML and UNICODE)

Actual result

Both searches displays only one article at the time.

System information (as much as possible)

Joomla 3.9.2

Additional comments

avatar kofaysi kofaysi - open - 23 Jan 2019
avatar joomla-cms-bot joomla-cms-bot - change - 23 Jan 2019
Labels Added: ?
avatar joomla-cms-bot joomla-cms-bot - labeled - 23 Jan 2019
avatar ReLater
ReLater - comment - 23 Jan 2019

Why do you use entities? Is there any reason to use them?

avatar kofaysi
kofaysi - comment - 23 Jan 2019

I have a bunch of articles at earlier times, around e.g. 2017-03-30 (Joomla always updated on the public release version) in which these entities occur. Some period later and new articles do not contain html entities, texts are unicode or utf-8 encoded.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.

avatar kofaysi
kofaysi - comment - 23 Jan 2019

I have a bunch of articles at earlier times, around e.g. 2017-03-30 (Joomla always updated on the public release version) in which these entities occur. Some period later and new articles do not contain html entities, texts are unicode or utf-8 encoded.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.

avatar kofaysi
kofaysi - comment - 23 Jan 2019

By opening and saving such article in the administrator site, the encoding changes, from html to unicode(?).


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.

avatar kofaysi
kofaysi - comment - 23 Jan 2019

By opening and closing such article in the administrator site, the encoding changes, from html to unicode(?).


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/23647.

avatar kofaysi
kofaysi - comment - 23 Jan 2019

The creation date of these articles containing html entities is ranging from 2015-10-06 to 2017-04-25. I have no clue, if I have changed some Joomla configuration afterwards or the JCE Editor changed or Joomla core changed after this or later date.

avatar kofaysi
kofaysi - comment - 24 Jan 2019

This is not the case for searching in frontend. A search with unicode characters returns also the phrase with html entities. A search with html entities is converted to unicode (in the search field) and returns also unicode entities.

avatar brianteeman
brianteeman - comment - 24 Jan 2019

Search for both versions of the chosen words in the administrator site, e.g. content:stáhnout and content:stáhnout.

I assume you are referring to using the search above the list of content items? The ability to search inside content was only recently added. It does not use finder. It is just a basic mysql search.

avatar infograf768
infograf768 - comment - 24 Jan 2019

@kofaysi
Can you try this?
Modify line 341 of file /administrator/components/com_content/models/articles.php
from

$query->where('(a.introtext LIKE ' . $search . ' OR a.fulltext LIKE ' . $search . ')');

to

$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search) . ' OR a.introtext LIKE ' . html_entity_decode($search) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search) . ' OR a.fulltext LIKE ' . html_entity_decode($search) . ')');
Not sure it is the correct way to solve the issue but it looks like working here.

avatar kofaysi
kofaysi - comment - 24 Jan 2019

I changed lines and copied the file back and tried the search. The behavior did not change. The search still finds only one of the two articles in question. I also logged out and logged in. No changes. Should I do anything more?

avatar kofaysi
kofaysi - comment - 24 Jan 2019

The line suggested, n. 341, does influence the search, but using only
$query->where('(a.introtext LIKE ' . html_entity_decode($search) . ')');
gives exactly the same results as using the suggested complex new line by @infograf768. In this simple case, I would expect that unicode encoded words wouldn't yield any result, because they are going toe be html encoded. But they do find the correct unicode encoded article.

Performing these searches on the SQL server directly gives an interesting findings. Searching for "maďarčina" gives only a single results from the two articles, the unicode/utf8 one. But the query contains (among others)
CONVERT(introtext USING utf8) LIKE '%maďarčina%'
That is: the introtext should be converted from HTML entities to utf8 and only after this conversion searched or checked. The article-introtext containing "maďarčina" has not been found in this case.

avatar infograf768
infograf768 - comment - 25 Jan 2019

The code I posted takes into account:

  1. Entering stáhnout in the search box and therefore finding articles who contain stáhnout AND stáhnout
  2. Entering stáhnout in the search box and finding the articles who contain stáhnout AND stáhnout

screen shot 2019-01-25 at 07 57 53

screen shot 2019-01-25 at 07 59 25

screen shot 2019-01-25 at 08 00 25

Therefore it does work here.

avatar kofaysi
kofaysi - comment - 25 Jan 2019

I'm reporting about searching for various words and their parts. Some searches were successful, some of them weren't. It seems like some of the characters are converted, some of them can't be converted, UTF8 <-> HTML.

  1. stáhnout and st\&aacute;hnout
    image
  • OK
    image
    image
  1. maďarčina and ma&dcaron;ar&ccaron;ina
    image
  • NOT OK
    image
    image
  • NOT OK searching for only part of the word
    image
    image
  1. čeština and \&ccaron;e\&scaron;tina
    image
  • NOT OK
    image
    image
  1. přitáhnout and p&rcaron;it&aacute;hnout
    image
  • NOT OK
    image
    image
  • OK searching for part of it
    image
    image
  • NOT OK searching for other part of it
    image
  • NOT OK searching for other part of it
    image
    image (other articles found, but no HTML entities article among the results)
  • NOT OK searching only for ASCII (it should not find anything, right?)
    image
avatar kofaysi
kofaysi - comment - 25 Jan 2019

I tried to modify the code using htmlentities($search, 'UTF-8') and `htmlentities($search, "UTF-8") with no luck: The code was not accepted by Joomla.

avatar kofaysi
kofaysi - comment - 25 Jan 2019

But html_entity_decode() does something. This search is successful with the code by @infograf768 :
image
I modified the code to

$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search, ENT_HTML5, "utf-8") . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search, ENT_HTML5, "utf-8") . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');

and using html_entity_decode($search, ENT_HTML5) works, see below for the whole Czech alphabet. I cannot get the parameters for htmlentities() right, though.

  • We found out that there is some problem converting č, ř, ď and other characters to HTML entities. So, now I'm going for all the Czech characters by using the Czech alphabet: aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
    and
    a&aacute;bc&ccaron;d&dcaron;e&eacute;&ecaron;fghchi&iacute;jklmn&ncaron;o&oacute;pqr&rcaron;s&scaron;t&tcaron;u&uacute;&uring;vwxy&yacute;z&zcaron;A&Aacute;BC&Ccaron;D&Dcaron;E&Eacute;&Ecaron;FGHChI&Iacute;JKLMN&Ncaron;O&Oacute;PQR&Rcaron;S&Scaron;T&Tcaron;U&Uacute;&Uring;VWXY&Yacute;Z&Zcaron;
    image
  • search for the HTML variant content:a&aacute;bc&ccaron;d&dcaron;e&eacute;&ecaron;fghchi&iacute;jklmn&ncaron;o&oacute;pqr&rcaron;s&scaron;t&tcaron;u&uacute;&uring;vwxy&yacute;z&zcaron;A&Aacute;BC&Ccaron;D&Dcaron;E&Eacute;&Ecaron;FGHChI&Iacute;JKLMN&Ncaron;O&Oacute;PQR&Rcaron;S&Scaron;T&Tcaron;U&Uacute;&Uring;VWXY&Yacute;Z&Zcaron;
    is successful
    image
  • But I cannot get the conversion the other way right. Search for the UFT-8 variant content:aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
    image
    failed.
avatar kofaysi
kofaysi - comment - 25 Jan 2019

This is really strange. Using htmlentities($search, ENT_HTML5) all special characters , e.g. á, č, ď, are converted to their ASCII variants: e.g. a, c, d. It was found out by searching for áčď and finding articles with acd.

So the closest I can get is using

$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search) . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search) . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');

fully compliant with á, é, í, ó, ú characters, but not with other special characters.

avatar infograf768
infograf768 - comment - 26 Jan 2019

Yeah, it looks like nothing works for some htmlentities.
The only viable solution I see for you is to edit and save again the articles containing htmlentities in db.
Boring but necessary.
BBEdit would do that fast for you on a dump of the _content table, but beware as some htmlentities you are using may not be compatible. Example &ccaron; would not be decoded, but &#x10D; or &#269; will correctly to č.

avatar kofaysi kofaysi - change - 26 Jan 2019
Title
content search and searching Special Characters (HTML, UNICODE)
content search and searching Special Characters (HTML, UTF-8)
avatar kofaysi kofaysi - edited - 26 Jan 2019
avatar kofaysi
kofaysi - comment - 26 Jan 2019

OK, I understand. Can you suggest any place, where I could report this unexpected behavior? (There is a PHP sandbox on the net, where this conversion works perfectly.)

avatar kofaysi
kofaysi - comment - 27 Jan 2019

On line 340 two percent symbols % are added to the search string from left and from right. Those two symbols are converted to \&percnt; by htmlentities($search, ENT_HTML5). Example: The query \&percnt;foo\&percnt; is searched in the SQL DB Instead of %foo%, when searching content:foo.

I came up with this brute-force solution then:
$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . '\'%' . htmlentities(trim(stripslashes(substr($search, 1, -1)), '%'), ENT_HTML5) . '%\'' . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . '\'%' . htmlentities(trim(stripslashes(substr($search, 1, -1)), '%'), ENT_HTML5) . '%\'' . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');

avatar kofaysi
kofaysi - comment - 6 Feb 2019

Do you think it is worth a PR?
I do not understand how I could get half of the SQL DB in HTML and half of it UTF-8. But there might be users they are not aware of this at all, their search just does not work sometimes as expected. On the other hand, the suggested solution is only a part of the problem: What about non-content searches, i.e. searches within Titles, Authors, Notes? That query should be corrected, too, right?

avatar HLeithner
HLeithner - comment - 6 Feb 2019

I think it makes no sense to try to convert the search query in every possible encoding supported on this planet. In my opinion its a problem with your data and maybe a used RCE Editor that converted a part of the content to a wrong encoding.

It's better you fix the your database instead of trying to push a workaround in to the core.

You write about tiles and authors but have thease fields the same problem?

avatar kofaysi
kofaysi - comment - 10 Feb 2019

I understand. Thank you for fixing my search. I'll check my older articles.

avatar kofaysi kofaysi - change - 10 Feb 2019
Status New Closed
Closed_Date 0000-00-00 00:00:00 2019-02-10 19:45:53
Closed_By kofaysi
avatar kofaysi kofaysi - close - 10 Feb 2019

Add a Comment

Login with GitHub to post a comment