Joomla! Issue Tracker | Joomla! CMS #9361 - J3.4.8 > 3.5.0-rc: Database Fix Message "The Joomla! Core database tables have not been converted yet to UTF-8 Multibyte (utf8mb4)."

Closed
12 Mar 2016
Medium
Build: master
# 9361

franz-wohlkoenig
10 Mar 2016

Steps to reproduce the issue

Update Joomla!3.4.8 to 3.5.0-rc

Expected result

No Error shown after "Fix Database"

Actual result

Messages

"Error: Duplicate entry 'ausschliesslich' for key 'idx_term' SQL=ALTER TABLE #__finder_terms CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"
"Warning: Database is not up to date!"
Database Problem: "The Joomla! Core database tables have not been converted yet to UTF-8 Multibyte (utf8mb4)."

System information (as much as possible)

PHP 5.6.13
MySQLi 5.5.47-0

Additional comments

Tried multiple Times "Fix"-Button, this Messages stay.

franz-wohlkoenig - open - 10 Mar 2016

andrepereiradasilva - comment - 10 Mar 2016

@richard67 can you check this one?

richard67 - comment - 10 Mar 2016

@wilsonge We had this before, too, didn't we? I am not sure but I think I remember we did nothing on this because clearing search index helped. Can you check?

wilsonge - comment - 10 Mar 2016

Yarp. But i really wanna take a look at this. Is there any way we can get access to your data @franz-wohlkoenig (at least the rows in mysql causing this issue) because we had this reported once before but unfortunately couldn't replicate it and the guy cleared his smart search history and didn't have a backup to give us :/

franz-wohlkoenig - comment - 10 Mar 2016

@wilsonge Like access? It's a Test-Site, so you can work as you wan't.

wilsonge - comment - 10 Mar 2016

That would be perfect :) Can you email me some credentials at george.wilson . at . community.joomla.org please?

franz-wohlkoenig - comment - 10 Mar 2016

3 E-Mails done.

wilsonge - comment - 10 Mar 2016

Thank you! Received them all. As soon as I'm back home I'll take a look and try and figure out what's going on :)

wilsonge - change - 11 Mar 2016

Labels

Added: ?

andrepereiradasilva - comment - 11 Mar 2016

@wilsonge this is the same as #9249 right?

richard67 - comment - 11 Mar 2016

@andrepereiradasilva I am not George, but I just have a bit time and know about it, too, ans so I answer: Yes, it is the same issue as #9249.

andrepereiradasilva - comment - 11 Mar 2016

ok, so a maintainer can close the other one and reference to this.

hmayer1980 - comment - 12 Mar 2016

I just copied my site and did a test upgrade and got the same issue.
What I found curious is, my word raising the error was also a german word which is usually written with a sharp s (https://en.wikipedia.org/wiki/%C3%9F).

richard67 - comment - 12 Mar 2016

Lowercase sharp s somewhere in the middle? Or uppercase? Uppercase sharp does not exists, it is a double s in uppercase.

We changed the sorting of collations from "_general_ci" so "unicode_ci" with 3.5.0 Beta something.

While general sorting is not correct at this point, handling "ß", unicode sorting is. Maybe it is related to this?

For lowercase the "ß" should normally not be converted to "ss", only for uppercase, e.g. when doing an "UPPER('Stoß')" you should get 'STOSS' with unicode collations.

But for sorting, also the lowercase "ß" has to be treated somehoe special (but of course not changed to "ss").

@franz-wohlkoenig In your database before update, is "ausschließlich" or "ausschliesslich"? Can you check that? And after conversion, when it stops with the error, how does it look then with sql? @hmayer1980 or can you check that, Hannes, if the "ß" is changed to "ss" somewhere in the conversion process?

@wilsonge Maybe I should provide a test container where we change it back from "unicode_ci" to "general_ci", so the guys can test if this then works for them? And what do we do if so, if that works? Fall back to general? Since we cannot force the connection collation to use unicode sorting, it might be the right way even. Or try to make it work?

richard67 - comment - 12 Mar 2016

@franz-wohlkoenig Can it be you have both keywords in your search index, "ausschließlich" and "ausschliesslich"? In this case, the conversion from general to unicode collation could make this in fact be a duplicate entry, because lowecase "ss" is an alternative spelling for "ß".

Please check and report back if you have both keywords before updating Joomla!, or if only 1 of them.

@wilsonge If my assumption is true that in case if both kinds of spelling for the sam word then make it be detected as duplicate with unicode collation, then the only reasonable way to solve this would be to fall back from unicode_ci to general_ci for this table at least (but this would be a shit)

Another way could be to detect and delete those future (asfter conversion) duplicates in advance (which are not duplicates before conversion), but this seems not to be practicable to me in an automized way.

Or we let Joomla! allow duplicate entries (but I see no unique key there so maybe the error comes from php, where is checked if duplicate or not?), but this would be a mess for a search where you need unique keywords.

So what shall we do if it turns out that my assumption is right?

_{This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9361.}

richard67 - comment - 12 Mar 2016

@wilsonge The clean way would be of course to leave it with unicode_ci sorting and advice people to clean their search index before updating, and then rebuild it after updating, if we cannot or shall not automize that.

But I am now almost 100% sure that this is the problem, that with unicode_ci, "ausschließlich" and "ausschliesslich" are considered to be the same, while with general_ci they are not.

Is not a problem with German language only, you have similar character phenomenons in Russian, Greece and others, where uppercase character is different to lowecase but regarding sorting they are treated in the same way.

I guess mysql is using the ICU library http://site.icu-project.org/ for handling these unicode sorting and collations and case change things, and if so, it is like I assume.

richard67 - comment - 12 Mar 2016

See also http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html, where differences between "_general_ci" and "_unicode_ci" aee described with the German example for letter "ß":

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 10.1.7.8, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
A difference between the collations is that this is true for utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):
ß = ss

So this is definitely the problem here.

I can provide a PR to fall back to general_ci either completely or only for this particular table and column, but I would like to have a decision if this is the way to go, @wilsonge .

Another way could be to do following before conversion: Loop through the keywords and try to store them in a temporary, unicode_ci sorted table, and if exception for duplicate keys in the new temporary table caught, then delete the corresponding record from the original table, so only records remain before conversion which would be unique after conversion, and future duplicates have been deleted before conversion.

But this would take a bit time, so maybe we better go for 3.5 the general_ci way and handle the change to unicode_ci later with 3.5.1. or 3.6?

_{This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9361.}

richard67 - comment - 12 Mar 2016

Or maybe we change the index on finder_terms term column from unique to non unique, then convert, then remove duplicates, then reestablish unique index. Problem is I do not know if there are foreign keys on this column, and if maybe other search components could have same problem, too.

_{This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9361.}

franz-wohlkoenig - comment - 12 Mar 2016

@richard67: There are both keywords ("ausschliesslich" and "ausschließlich") in #_finder_terms. After Conversion to 3.5.0-rc both keywords stay the same.

richard67 - comment - 12 Mar 2016

@franz-wohlkoenig Yeah, see my comments above.

That's the problem, thanks for confirmation. Is the difference between incorrect handling with utf8(mb4)_general_ci collations and correct handling with utf8(mb4)unicode_ci collations.

If you have any ideas or preferences among the possible solutions I provide in my previous 4 comments, let me know please.

franz-wohlkoenig - comment - 12 Mar 2016

"Clean search index before updating and rebuild it after" is in this Situation (kick yellow Blocker away) an easy Solution.

richard67 - comment - 12 Mar 2016

@wilsonge and all who could know that:

Does anybody know if there is a php function which can be called to clear search index?

If there is such, then I could add to the utf8(mb4) conversion routines a step which checks if there are duplicated and only clears the search index if so, and queues a message on that so the user knows he has to rebuild index after update. This would happen only with mysql databases and only once when conversion is performed.

richard67 - comment - 12 Mar 2016

@wilsonge and all who might have wanted to help:

I have found it:

FinderModelIndex::purge();

Is defined in administrator/components/com_finder/models/index.php.

We can call that when doing utf8(mb4) conversion for mysql either anyway or when a query shows we have duplicates.

And we can issue an info message on this.

Would have to be done in script.php and the schema manager Fix, where we run the 2 conversion sql scripts.

richard67 - comment - 12 Mar 2016

But we have language freeze already ... so problem with a new message :-(

chrisdavenport - comment - 12 Mar 2016

Be careful. The purge() method will destroy any static filters you have defined. This is because when you rebuild the index all the taxonomy ids will have changed. You can see how to (usually) avoid the problem by looking at the --purge option in /cli/finder_indexer.php.

However, re-indexing can also take several minutes or more to run (for a moderately large site it can take hours). So I think it would be better to avoid trying to do this automatically. As long as the error does not terminate the process, then I think it would be better to show the user a message saying something like "There was an error in the Smart Search index that will require a complete purge and re-index of the site" and leave it for the user to perform the action. On small sites that can be done in Components -> Smart Search, but on many sites the CLI method will be preferred.

richard67 - comment - 12 Mar 2016

@chrisdavenport Ok, but then we have the problem that beause having a unique key on the term column of table finder_terms, we have an SQL error at this point, and the unique key will not be created.

Does the finder need them to be unique? If not, we can maybe just make a normal key and when someone rebuilds the index for any other reason then they will be unique again (but not reruiring a unique key in database schema)?

brianteeman - comment - 12 Mar 2016

Remember that this will only effect sites using finder in a non-english
language - certainly not a high % of total sites

On 12 March 2016 at 15:39, Richard Fath notifications@github.com wrote:

@chrisdavenport https://github.com/chrisdavenport Ok, but then we have
the problem that beause having a unique key on the term column of table
finder_terms, we have an SQL error at this point, and the unique key will
not be created.

Does the finder need them to be unique? If not, we can maybe just make a
normal key and when someone rebuilds the index for any other reason then
they will be unique again (but not reruiring a unique key in database
schema)?

—
Reply to this email directly or view it on GitHub
#9361 (comment).

Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
http://brian.teeman.net/

richard67 - comment - 12 Mar 2016

Easy solution could be either to revert the change from _unicode_ci collations to _general_ci collations in general for all core tables, this would mean we are on the safe side regarding this problem in general but not benefit from correctness of unicode sorting, or we revert it only for the complete table #__finder_terms.

We cannot do it easily for just the 1 column term of this table, because utf8(mb4) conversion runs per table and not per column (which also would be possible but a mess to maintain in the sql scripts).

Because it also is not easy to force connection collation to use unicode sorting, even if available, we will maybe not benefit anyway from the unicode sorting, and so the complete fallback for all core tables would be the best?

chrisdavenport - comment - 12 Mar 2016

I think the index on the terms table needs to be unique.

What will happen if you get a collision like this is that one of the search terms will no longer work, so it might be for example, that searching for "ausschliesslich" will work but "ausschließlich" will not (or it might be the other way around). That's not what I would consider to be a serious issue. I think we should do the database conversion and just report the error to the user if it arises (and as Brian pointed out, this is not likely to affect many sites). Ideally point them to some documentation which explains how to fix it. I'm happy to write that documentation.

richard67 - comment - 12 Mar 2016

I have a solution, I am just making a PR: We will keep the term column of the finder_term table in general collation so it works as in past, different search terms for different spellings of e.g. "ausschließlich" and "ausschliesslich". Just am preparing the PR.

brianteeman - change - 12 Mar 2016

Labels

Added: ?

chrisdavenport - comment - 12 Mar 2016

Wouldn't it be better to have both of those alternative spellings pointing to the same articles? From my tiny knowledge of German I would have thought they would be considered equivalent. Or not?

wilsonge - comment - 12 Mar 2016

In that case maybe - but not all the time - see the comment here Chris https://mathiasbynens.be/notes/mysql-utf8mb4#comment-52 (not advocating for utf8mb4_bin - but it's the problem with the utf8mb4_unicode_ci that's the point)

richard67 - comment - 12 Mar 2016

My new PR for this issue is #9387 . @franz-wohlkoenig @hmayer1980 Please test.

@swiffer Since issue #9249 was from you, can you test this PR here, too?

chrisdavenport - comment - 12 Mar 2016

Hmm. Interesting. So utf8mb4_unicode_ci works for German but not Greek, whereas utf8mb4_general_ci works for Greek but not German. It's Germany vs. Greece (again)! Looks like utf8mb4_general_ci essentially keeps the current behaviour, which is probably the best approach for now.

Going forwards, it might be fruitful to have a method in the language class which will canonicalise words for each language. This can be called when a search term is entered, before doing the database operation. That will ensure that only one of "ß" or "ss" is used in the index even though the database itself could use either. Then the Greeks and the Germans would both be happy!

richard67 - comment - 12 Mar 2016

Good idea for the future .. but have no idea how I could do that. Just wanted to fix bug for both germans and greece, but seems you are right and we will now keep current behavior as best what we can do for all (except the Greece).

And it is not because I am German why I made this PR. Not wanna tease the Greece.

franz-wohlkoenig - comment - 12 Mar 2016

@wilsonge I will restore Test-Site to 3.4.8 for testing PR of @richard67, if you give okay.

@chrisdavenport As Austrian with Language "German" all is said.

chrisdavenport - comment - 12 Mar 2016

I hope I didn't offend anyone. Greeks, Germans, Austrians, I love you all. :-)

I've added search word canonicalisation to my to-do list. It shouldn't be hard to implement, but given my current backlog, don't expect a PR soon. If anyone reading this wants to have a go, drop me a line and I'll point you in the right direction.

wilsonge - comment - 12 Mar 2016

OK I'm going to close this issue in favour of the PR at #9387 - Can we get some tests there please :)

Btw when I tried searching for the term "ausschließlich" or "ausschliesslich" in your site I got an error that the context hasn't been indexed which was kinda weird....

wilsonge - change - 12 Mar 2016

Status	New	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2016-03-12 18:25:58
Closed_By		⇒	wilsonge

wilsonge - close - 12 Mar 2016

wilsonge - change - 12 Mar 2016

Labels

Removed: ?

wilsonge - reference | 161bc19 - 13 Mar 16

Stanzilla - comment - 22 Mar 2016

I had the same error just with the word hätte and fixed it like this http://forum.joomla.org/viewtopic.php?f=707&p=3379083#p3379211

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#9361] - J3.4.8 > 3.5.0-rc: Database Fix Message "The Joomla! Core database tables have not been converted yet to UTF-8 Multibyte (utf8mb4)."

Steps to reproduce the issue

Expected result

Actual result

System information (as much as possible)

Additional comments

Add a Comment