content search and searching Special Characters (HTML, UNICODE)
Update search/finder index, if necessary.
Both articles containing the same word (in meaning) are found for both searches (HTML and UNICODE)
Both searches displays only one article at the time.
Joomla 3.9.2
Labels |
Added:
?
|
I have a bunch of articles at earlier times, around e.g. 2017-03-30 (Joomla always updated on the public release version) in which these entities occur. Some period later and new articles do not contain html entities, texts are unicode or utf-8 encoded.
I have a bunch of articles at earlier times, around e.g. 2017-03-30 (Joomla always updated on the public release version) in which these entities occur. Some period later and new articles do not contain html entities, texts are unicode or utf-8 encoded.
By opening and saving such article in the administrator site, the encoding changes, from html to unicode(?).
By opening and closing such article in the administrator site, the encoding changes, from html to unicode(?).
The creation date of these articles containing html entities is ranging from 2015-10-06 to 2017-04-25. I have no clue, if I have changed some Joomla configuration afterwards or the JCE Editor changed or Joomla core changed after this or later date.
This is not the case for searching in frontend. A search with unicode characters returns also the phrase with html entities. A search with html entities is converted to unicode (in the search field) and returns also unicode entities.
Search for both versions of the chosen words in the administrator site, e.g. content:stáhnout and content:stáhnout.
I assume you are referring to using the search above the list of content items? The ability to search inside content was only recently added. It does not use finder. It is just a basic mysql search.
@kofaysi
Can you try this?
Modify line 341 of file /administrator/components/com_content/models/articles.php
from
$query->where('(a.introtext LIKE ' . $search . ' OR a.fulltext LIKE ' . $search . ')');
to
$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search) . ' OR a.introtext LIKE ' . html_entity_decode($search) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search) . ' OR a.fulltext LIKE ' . html_entity_decode($search) . ')');
Not sure it is the correct way to solve the issue but it looks like working here.
I changed lines and copied the file back and tried the search. The behavior did not change. The search still finds only one of the two articles in question. I also logged out and logged in. No changes. Should I do anything more?
The line suggested, n. 341, does influence the search, but using only
$query->where('(a.introtext LIKE ' . html_entity_decode($search) . ')');
gives exactly the same results as using the suggested complex new line by @infograf768. In this simple case, I would expect that unicode encoded words wouldn't yield any result, because they are going toe be html encoded. But they do find the correct unicode encoded article.
Performing these searches on the SQL server directly gives an interesting findings. Searching for "maďarčina" gives only a single results from the two articles, the unicode/utf8 one. But the query contains (among others)
CONVERT(
introtext USING utf8) LIKE '%maďarčina%'
That is: the introtext should be converted from HTML entities to utf8 and only after this conversion searched or checked. The article-introtext containing "maďarčina" has not been found in this case.
I'm reporting about searching for various words and their parts. Some searches were successful, some of them weren't. It seems like some of the characters are converted, some of them can't be converted, UTF8 <-> HTML.
I tried to modify the code using htmlentities($search, 'UTF-8')
and `htmlentities($search, "UTF-8") with no luck: The code was not accepted by Joomla.
But html_entity_decode()
does something. This search is successful with the code by @infograf768 :
I modified the code to
$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search, ENT_HTML5, "utf-8") . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search, ENT_HTML5, "utf-8") . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');
and using html_entity_decode($search, ENT_HTML5)
works, see below for the whole Czech alphabet. I cannot get the parameters for htmlentities()
right, though.
aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
content:aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
content:aábcčdďeéěfghchiíjklmnňoópqrřsštťuúůvwxyýzžAÁBCČDĎEÉĚFGHChIÍJKLMNŇOÓPQRŘSŠTŤUÚŮVWXYÝZŽ
This is really strange. Using htmlentities($search, ENT_HTML5)
all special characters , e.g. á, č, ď, are converted to their ASCII variants: e.g. a, c, d. It was found out by searching for áčď
and finding articles with acd
.
So the closest I can get is using
$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . htmlentities($search) . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . htmlentities($search) . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');
fully compliant with á
, é
, í
, ó
, ú
characters, but not with other special characters.
Yeah, it looks like nothing works for some htmlentities.
The only viable solution I see for you is to edit and save again the articles containing htmlentities in db.
Boring but necessary.
BBEdit would do that fast for you on a dump of the _content table, but beware as some htmlentities you are using may not be compatible. Example č
would not be decoded, but č
or č
will correctly to č
.
Title |
|
OK, I understand. Can you suggest any place, where I could report this unexpected behavior? (There is a PHP sandbox on the net, where this conversion works perfectly.)
On line 340 two percent symbols %
are added to the search string from left and from right. Those two symbols are converted to \%
by htmlentities($search, ENT_HTML5)
. Example: The query \%foo\%
is searched in the SQL DB Instead of %foo%
, when searching content:foo
.
I came up with this brute-force solution then:
$query->where('(a.introtext LIKE ' . $search . ' OR a.introtext LIKE ' . '\'%' . htmlentities(trim(stripslashes(substr($search, 1, -1)), '%'), ENT_HTML5) . '%\'' . ' OR a.introtext LIKE ' . html_entity_decode($search, ENT_HTML5) . ' OR a.fulltext LIKE ' . $search . ' OR a.fulltext LIKE ' . '\'%' . htmlentities(trim(stripslashes(substr($search, 1, -1)), '%'), ENT_HTML5) . '%\'' . ' OR a.fulltext LIKE ' . html_entity_decode($search, ENT_HTML5) . ')');
Do you think it is worth a PR?
I do not understand how I could get half of the SQL DB in HTML and half of it UTF-8. But there might be users they are not aware of this at all, their search just does not work sometimes as expected. On the other hand, the suggested solution is only a part of the problem: What about non-content searches, i.e. searches within Titles, Authors, Notes? That query should be corrected, too, right?
I think it makes no sense to try to convert the search query in every possible encoding supported on this planet. In my opinion its a problem with your data and maybe a used RCE Editor that converted a part of the content to a wrong encoding.
It's better you fix the your database instead of trying to push a workaround in to the core.
You write about tiles and authors but have thease fields the same problem?
I understand. Thank you for fixing my search. I'll check my older articles.
Status | New | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2019-02-10 19:45:53 |
Closed_By | ⇒ | kofaysi |
Why do you use entities? Is there any reason to use them?