?
avatar joeforjoomla
joeforjoomla
14 Mar 2016

Steps to reproduce the issue

All third party extensions using MySQL JOIN with core tables such as #__sessions, #__menu, etc on primary key fields won't work anymore because the collation change from utf8_general_ci to utf8_unicode_ci

Use a query like this one:

SELECT * FROM #__mycustomtable AS s LEFT JOIN #__menu AS m ON s.menutype= m.menutype

SELECT * FROM #__mycustomtable AS s LEFT JOIN #__session AS m ON s.session_id= m.session_id

Expected result

Instead of having a successful query based on the JOINED 'menutype' field it will fail because the table used by the third party extension '#__mycustomtable' is still based on utf8_general_ci

Actual result

Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT) for operation '=' SQL=SELECT ....

System information (as much as possible)

Additional comments

Developers should be at least warned that all their extensions based on such queries will be broken and have time to update all tables of their extensions accordingly. At this point March 16 as the release date of Joomla 3.5 is not acceptable, you are going to break a lot of third party extensions in a subtle way. 3 of my extensions were broke because of this issue.
@spignataro @dunglas @beat

avatar joomlajoe joomlajoe - open - 14 Mar 2016
avatar joomlajoe joomlajoe - change - 14 Mar 2016
Title
B/C break for all third party extensions due to MySQL collation change
B/C break for all third party extensions due to MySQL collation change!!!
avatar Bakual
Bakual - comment - 14 Mar 2016

I'm a bit confused because the collation should be utf8mb4_unicode_ci and not utf8_unicode_ci as you stated.
I don't have that issue with my extension. I do have joins over the #__users table. I guess it only becomes an issue if we face varchar/text fields?

avatar spignataro
spignataro - comment - 14 Mar 2016

Can you tell me which extension / statement location you are having issues with?

avatar joomlajoe
joomlajoe - comment - 14 Mar 2016

All extensions using an SQL query like these one are affected. Core Joomla tables are currently changed from utf8_general_ci to utf8_unicode_ci, see screenshot
collation_utf8

avatar joomlajoe joomlajoe - change - 14 Mar 2016
Status New Closed
Closed_Date 0000-00-00 00:00:00 2016-03-14 20:48:09
Closed_By joomlajoe
avatar joomlajoe joomlajoe - close - 14 Mar 2016
avatar joomlajoe joomlajoe - close - 14 Mar 2016
avatar joomlajoe joomlajoe - change - 14 Mar 2016
Status Closed New
Closed_Date 2016-03-14 20:48:09
Closed_By joomlajoe
avatar joomlajoe joomlajoe - reopen - 14 Mar 2016
avatar joomlajoe joomlajoe - reopen - 14 Mar 2016
avatar joomlajoe
joomlajoe - comment - 14 Mar 2016

NOTE that the issue happens only when the JOIN involves fields with different collation:
SELECT * FROM #__mycustomtable AS s LEFT JOIN #__menu AS m ON s.menutype= m.menutype

s.menutype -> utf8_general_ci
m.menutype -> utf8_unicode_ci or whatever you want utf8mb4_unicode_ci

Results: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_unicode_ci,IMPLICIT)

http://stackoverflow.com/questions/9197300/illegal-mix-of-collations-utf8-general-ci-implicit-and-utf8-unicode-ci-implic

avatar richard67
richard67 - comment - 14 Mar 2016

@mbabker Depends, if he has mysql server lower than 5.5.3 or mysqlnd cient lower than 5.0.9 (as far as I remember) or other mysql client lower than 5.5.3 or maybe problem with MariaDB and version number detection on Windows, then utf8 is correct, not utf8mb4. But the collations we also change from _general_ci to _unicode_ci for these, so the formerly called utf8mb4 conversion is meanwhile a utf8(mb4) conversion.

avatar wilsonge
wilsonge - comment - 14 Mar 2016

At this point March 16 as the release date of Joomla 3.5 is not acceptable

Just to say that this code has been present since beta 1 4 MONTHS AGO (and in the 3.5 branch for nearly 9 months....). I'm sorry - if someone comes up with a fix then I'll be more than happy to work it in - but as far as i'm concerned this isn't a release blocker....

avatar joomlajoe
joomlajoe - comment - 14 Mar 2016

Sorry George but i can't agree with you. As long as a core database change breaks backward compatibility with third extensions and *this is not officially well documented * since the beta 1 and months ago as it should have been, i must consider it a bad managed approach, i must consider a bad Joomla project management. So you are going to release on Wednesday a Joomla 3.5 release that is going to break an unpredictable number of extensions.
Sadly i have to deal with my customers sending me dozens of emails if extensions stop working, and this doesn't like to nobody, nor to me nor to final users.

avatar JoomliC
JoomliC - comment - 14 Mar 2016

If this could help, i've just got the same error, but only with RC2!
I'm testing my extension since beta1, and no such issue before. Seems related only to a recent change for RC2...


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9423.

avatar mbabker
mbabker - comment - 14 Mar 2016

The way I see it the options are to basically press on with things as they are or roll back every database layer change made for 3.5. Because frankly something in that conversion is going to screw someone. Be it the fact that the architecture changes have caused the Extension Manager to not be an upgrade path any longer, this collation change (which frankly I've been getting reports of on my own extensions for years, mixed collations in the database is nothing new and nothing that can be addressed in core unless the tables never change or core takes on the responsibility of managing third party schemas, that'll go over well with developers), or causing someone to no longer be able to move from a MySQL 5.5.3+ to 5.5.2- environment.

Just call it 4.0 and be done with it, then at least users expect everything's gonna break hard. </sarcasm>

avatar wilsonge
wilsonge - comment - 14 Mar 2016

@JoomliC in RC2 we went from general_ci to unicode_ci. The thing is if we revert back to general_ci then everyone currently on unicode_ci hits the same bug. Or either way people on latin or swedish get screwed....... So you not getting the error was just down to your setup and not necessarily that change fixing the problem for all users...

avatar mbabker
mbabker - comment - 14 Mar 2016

It's screwed anyway. What happens with utf8_general_ci and utf8mb4_general_ci?

avatar JoomliC
JoomliC - comment - 14 Mar 2016

To add more, the only query join that returns the error for me, and ONLY in RC2 :

        // Join over the language
        $query->select('l.title AS language_title')
            ->join('LEFT', $db->quoteName('#__languages') . ' AS l ON l.lang_code = e.language');


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9423.

avatar joomlajoe
joomlajoe - comment - 14 Mar 2016

I don't say to not upgrade from utf8_general_ci to utf8_unicode_ci, this can be done BUT this must be well documented so that developers will be aware of this change months ago! Developers must have time to update their database tables accordingly to changes of core tables. The Joomla core can't know which tables need an update in third party extensions, so developers have to test and do this. Joomla is not your project, it's a project of a community and people that built business around it, think twice before change things.

avatar JoomliC
JoomliC - comment - 14 Mar 2016

@wilsonge i understand... the problem is that you have the same issue with com_weblinks ( :-( )


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9423.

avatar mbabker
mbabker - comment - 14 Mar 2016

Joe by your arguments then the only fix is as I said before, revert all the database layer changes because core cannot guarantee that a database schema change won't break extensions. I'll save all of us the work, just revert back to 3.4.8 and halt development because a bug fix might break some extensions. You see how crazy these arguments can get?

The utf8mb4 character set changes have been well documented for months. Yes, the collation change is more recent but one that didn't just magically appear out of nowhere.

Joomla is not your project, it's a project of a community and people that built business around it, think twice before change things.

Ya know, I've been trying to make Joomla "not my project" for a year and a half. Let me try to count how many times I've publicly asked people to pick up my tasks, told people bluntly it needs to happen, and that I can't be relied on to do everything. I'm only here because contrary to my public image, I'm not a complete ass and won't just take my ball and leave and hope someone figures out how to do half the stuff I've been doing. So no, I'm not here because I enjoy it anymore; I'm here because if I leave more of this project will go down in flames than any one person realizes. You want me gone, take over my responsibilities and I'll go find a new life because Joomla having consumed mine isn't fun for me anymore.

avatar joomlajoe
joomlajoe - comment - 14 Mar 2016

I agree with you Michael, fully. Of course my words are not meant to offend nobody, even if i have hard words. You are a great person, i know what you did for Joomla and i respect you 100%. You are a great guy i know, and you did even too much for Joomla. But that's not the point, the point is that Joomla 3.5 must not introduce B/C breaks, that's all. Everything else i don't care, all your great words are unuseful.

avatar beat
beat - comment - 15 Mar 2016

I haven't seen any B/C issues in Community Builder yet as we join usually on primary keys which are always integers. However as this B/C issue seems confirmed, it is imho a major issue.

Just a few stupid questions to try to address this issue constructively:

  1. Do we still need 4-bytes UTF-8 in database with proper filtering implemented at PHP level for older MySQL versions ?

If yes:
2. Would using general_ci fix the issue (not sure about that one) ?

If no: Maybe we can postpone the collation change to upcoming Joomla 4 while leaving all the other great stuff in 3.5 in a RC4 (do not worry about B/C towards RC2/3) ?

avatar wilsonge
wilsonge - comment - 15 Mar 2016
  1. Yes - we definitely do - there's still a decent number of users running MySQL 5.1
  2. No - general_ci would fix users currently on general_ci but break users on unicode_ci or latin/swedish collations.
  3. You can't have utf8mb4 support and not have collation change....
avatar mahagr
mahagr - comment - 15 Mar 2016

Just linked similar discussion from Kunena into this topic.

If Joomla doesn't update all 3rd party extension tables during 3.5 upgrade, many sites will break (on bad joins), regardless if component has added support for new collation. If there is automatic collation change, well, then everything should generally be fine after upgrade.

Another issue is database table generation when installing components -- collation of all new tables should be the same as in Joomla.

The issue is that I don't think that automatic collation change really works as some extensions may use binary/custom collation in some of the tables and changing collation in those may result to unexpected behavior.

Just my 2 cents. I've not really looked into the implementation, so do not think that these are real issues. They are just my concerns that came into my mind when Kunena added the support.

avatar richard67
richard67 - comment - 15 Mar 2016

Was a bit involved in doing the utf8mb4 stuff for the core, so I was curious on how Kunena Forum did with the commit linked above. Looks good to me for new Kunena installations.

Do they also have scripts or other migration strategies to change existing installations (index resizing, maybe columns resizing for not losing data, then "alter table ... convert to character set ... collate ..." and so on)?

If not, would it be helpful for them to save information about conversion status in database? We could extend the new table #__utf8_conversion in core by a column for the extension ID (would also be PK, and for the core the hardecode core extension ID), so extensions which have to handle migration/conversion can use it to keep track of the conversion status (0 = not done yet, 1 = converted to utf8_unicode_ci or for a few exceptions still utf8_general_ci, and 2 = converted to utf8mb4_unicode_ci or for a few exceptions utf8mb4_general_ci).

By comparing the desired conversion status according to actual server and client's capability with the status saved in this table, it can be detected if a conversion is necessary, e.g. after migrating the database from an older to a newer mysql server version.

Would that help extension developers? If so, I could prepare a PR for the core to allow that (and adapt the conversion status checks of the core to check only for core extension id).

Feedback is welcome.

avatar spignataro
spignataro - comment - 15 Mar 2016

My concern here is like anyone else. The last thing you want to do is massively break peoples sites. The first thing that many people do is update joomla with components last. And component developers may not be able to get a release out in time either due to back log or just learning about the issue.

Since this is a B/C issue and the only one that I can find after reviewing the list of changes I would suggest to hold this change off till 4.x release. The reason for this is that you don't want millions of Joomla! sites to stop working and then you are overloading the 3rd party developers with support tickets to remedy the issue. Last thing I want is 50,000 support tickets to show in a day. Look I have no bandwidth to handle that.

So my vote and my 2 cents is to roll back this one change so that you don't effect sites or render sites useless.

avatar richard67
richard67 - comment - 15 Mar 2016

Problem is that an automatic collation change without losing any data or violating uniqe keys is not always possible. Furthermore, to get collation information for tables it requres access to the information_schema (or however is called), and this is not granted on every hosted database server, as far as I heard.

avatar spignataro
spignataro - comment - 15 Mar 2016

@richard67 - you make a valid point as some hosting companies restrict what you can and can't do in mysql.

avatar richard67
richard67 - comment - 15 Mar 2016

@spignataro Yes, and that's also the reason why Joomla! does not and never should change the default database collation with a new installation or with this new utf8(mb4) conversion (except of new installation with not existing database and database creating privilges for specified user on specified server, so a new database will be created by the installation, which is mainly the case on local testing environments).

avatar beat
beat - comment - 15 Mar 2016

@richard67 @spignataro In Community Builder, since almost 2 years, we do use SHOW COLUMNS and SHOW INDEX to check/fix database structures on every installation and upgrade. I just checked with our support: We didn't have any complaints from users for hosting restrictions on SHOW SQL statements so far. Same for changes of collations. Joomla's database uses that too on most queries. So don't think that this is an issue.

avatar richard67
richard67 - comment - 15 Mar 2016

@beat @spignataro Yes, for columns you can use "SHOW FULL COLUMNS" for getting the collation of the column (which also includes the character set).

The core uses it without the "FULL" up to now for its database schema check, because these not check the collation as the core does not use regular schema updates to do the conversion, but I do not think there are limitations for that.

But for default collation of complete tables I haven't found another solution that a query to information_schema stuff, because "SHOW FULL TABLES" or so does not work or not contains the collation info, don't remember now how it was when I tried this. And these queries I was told are not supported everywhere.

avatar beat
beat - comment - 15 Mar 2016

@wilsonge I don't think you understood my question, which was maybe not well formed.

Do we still need 4-bytes UTF-8 in database with proper filtering implemented at PHP level for older MySQL versions ?

I meant: Now that we have the proper UTF-8 4-bytes filtering implemented for MySQL 5.1 in Joomla 3.5 (and that part must sure be kept): Do we still have an urgency or any other important reason for also migrating the SQL tables to UTF-8mb4 (in Joomla 3.5 - before Joomla 4) ?

It is a welcome progress, and should sure be kept for Joomla 4 (or maybe even for 3.6 with proper resolution of breaking sites and enough in advance pre-announcement to extensions-developer e.g. through the JED Newsletter), question is only benefit/hassle ratio for Joomla 3.5.

avatar richard67
richard67 - comment - 15 Mar 2016

@beat P.S.: Of course a way around is to use "SHOW FULL COLUMNS" to check a particular column of each table and so see it needs an "alter table xxx conver to character set yyy collate yyy_zzz", but then you have to know which column you can check, and not every tanle has text columns, some have numeric only, but also for these the default collation has to be set/changed.

avatar mahagr
mahagr - comment - 15 Mar 2016

@richard67 Kunena generally avoids of using joins on text fields, but it does have its own update logic which is pretty complicated. Because of that there were a few pull requests for fixing the issue. I don't really know all of the details as I'm no longer directly involved in Kunena (because of the time constrains coming from my personal life and work).

Are you saying that there is no automatic collation conversion on upgrades? That is great as it means that existing sites would not break during an update to J3.5. It does however leave the issue of people installing and upgrading extensions in their sites and those running SQL queries to update DB schema.

I think that extension developers would benefit most from having information about this change and some rather simple ways to handle it.

avatar beat
beat - comment - 15 Mar 2016

@richard67 @spignataro I just re-checked in CB: We already use SHOW FULL COLUMNS and SHOW FULL INDEX since over 2 years for all MySQL 5.x installs. Without any issues reported to our support, so, given the sample number of CB installs, I can say with 99.9% statistical certainty that SHOW FULL is not an issue for current Joomla hostings.

@richard67 Community Builder has also its own updater, and it sets its collations to be same as Joomla's __users table. However as I understand the issue, for other extensions using JOIN on TEXT (which might be less extensions than we think btw) the issue is that both core and their extensions would have to be upgraded collations simultaneously to avoid site breaks. It's not like you could upgrade first the extension then Joomla's collation ?

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

A quick solution that i adopted to fix my extensions that were broken, is to check the collation of Joomla core tables/fileds with SHOW FULL COLUMNS in the component install script then update the component tables/fields involved in the JOIN with the new collation used by the Joomla core table adopting a query like:
"alter table xxx conver to character set yyy collate yyy_zzz"

Of course this requires that you know where your extension is doing an SQL JOIN with fields having a different collation

avatar wilsonge
wilsonge - comment - 15 Mar 2016

Honestly I don't understand why people would join over the session id or menutype - and i certainly would suspect it's not common.

My major worry was extensions joining over the language table on the alias column (like we do a lot in Joomla core). However over the last 12 hours since this issue came to my attention I've got several people to upgrade for me to 3.5 RC3 on multilang sites and none of them have hit errors yet....

avatar richard67
richard67 - comment - 15 Mar 2016

@beat

it sets its collations to be same as Joomla's __users table

How does CB do that? Is it a "show full columns" for a known varchar or text column of the users table? If not: Can you post the statement to obtain this info here?

@joomlajoe @beat

For joins on text or varchar or so: Good you already avoid that. There are modifications for queries with joins:

  • If only collations are different but charset is same, just append e.g. a " COLLATE utf8mb4_unicode_ci" or whatever you need to the end of the query. But the charset has to be supported, i.e. if utf8mb4 is not supported this example witll fail.
  • If charsets differ, too, you have to convert the columns in the join condition, e.g. "CONVERT(a.col_a USING utf8mb4) = CONVERT(b.col_b USING utf8mb4)" or so.
  • You can combine both if necessary.
avatar wilsonge
wilsonge - comment - 15 Mar 2016

Honestly on the basis that this code was PR'd 9 months ago (to the day), and merged about 8 months ago and in formal beta's since 4 months ago - I'd guess the amount of sites affected by this is relatively limited else we would have had the issue reported to us already. It's far from ideal and I'm the first to admit that. But moving to unicode is going to give a lot of multilang sites a better experience.

In terms of semantic versioning the public API is backwards compatible - so this is fine to go in a minor release imho

avatar beat
beat - comment - 15 Mar 2016

@wilsonge Agree that Menus and sessions have proper APIs in Joomla and I would vote that accessing those tables directly with a JOIN is not really part of Joomla API.

As said above, maybe the issue is affecting only very few extensions. We still need to know which extensions have been confirmed breaking so far (Community Builder certainly isn't broken by Joomla 3.5 RC2 according to our latest B/C tests). So far I only see 3 own ones from @joomlajoe mentionned expressely.

avatar richard67
richard67 - comment - 15 Mar 2016

Did anyone check already if my idea above for using the #__utf8_conversion table to store extensions conversion status could be useful for core or 3rd party extension developers?

Or is it not needed (at least CB seems not to have a need)?

avatar brianteeman
brianteeman - comment - 15 Mar 2016

@joomlajoe it would be useful to know what your extensions are - searching for your name doesnt give any clues.


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/9423.

avatar mbabker
mbabker - comment - 15 Mar 2016

For anyone even mildly concerned with (or finding an environment with) the collation changes, test this scenario.

Create a database with the tables specified in this gist (just copy/paste the contents, it'll create everything for you). Note that this dump comes off a MySQL 5.6.29 database. Also note each table explicitly uses different collation/charset configurations. With this database, run this query:

SELECT c.*, a.name AS asset_name, l.title AS language_title FROM jos_categories AS c LEFT JOIN jos_assets AS a ON a.id = c.asset_id LEFT JOIN jos_languages AS l ON l.lang_code = c.language;

Report your results and server stats. This is a pretty realistic stub of a query that could be used in the core CMS.

avatar mahagr
mahagr - comment - 15 Mar 2016

I've been thinking about the conversion table, but I cannot see any use for it unless it can be used to list all text fields so that Joomla could do automatic conversion for everything at once. The issue here is that I don't think that majority of the extension developers will use the feature making it impossible to do the conversion safely from a single location.

I also agree that JOINs over text fields may not be that common except maybe for the language table. I guess that for me the major issue with this change is that I still do not fully understand when the collation change gets applied and in what extend.

avatar richard67
richard67 - comment - 15 Mar 2016

@mahagr

can be used to list all text fields

This would not be far away from my favourite solution for a new schema handling in 4.0 or 5.0, that the core has own schema information tables (like repository) where it can keep track of different column and indes lengths for different charsets.

avatar brianteeman
brianteeman - comment - 15 Mar 2016

I guess that for me the major issue with this change is that I still do
not fully understand when the collation change gets applied and in what
extend.

I guess you will have to take the time to read the code. Sorry I'm not
wasting my time on theoretical what ifs

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

@brianteeman the problem occurs when you have SQL JOINS involving varchar (or text but this is really uncommon so that's not a problem).
If you have a JOIN with the session_id field in the #__session table it will fail, if you have a JOIN with the menutype field in the #__menu table it will fail, if you have a JOIN with the lang_code or sef field in the #__languages table it will fail, etc

The point is not what are extensions affected, nobody can be aware of what extensions are using a JOIN query like these ones. The point is that this is a B/C break, now Joomla creates tables as: ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 DEFAULT COLLATE=utf8mb4_general_ci

Moreover this was not documented and developers were not aware of this possible break, so for me this is a blocker issue that should be postponed to Joomla 4 when all extensions will be updated to create new database tables accordingly.

avatar brianteeman
brianteeman - comment - 15 Mar 2016

Unsubscribed. I dont waste my time with theoreticals

On 15 March 2016 at 14:56, joomlajoe notifications@github.com wrote:

@brianteeman https://github.com/brianteeman the problem occurs when you
have SQL JOINS involving varchar (or text but this is really uncommon so
that's not a problem).
If you have a JOIN with the session_id field in the #__session table it
will fail, if you have a JOIN with the menutype field in the #__menu table
it will fail, if you have a JOIN with the lang_code or sef field in the
#__languages table it will fail, etc

The point is not what are extensions affected, nobody can be aware of what
extensions are using a JOIN query like these ones. The point is that this
is a B/C break, now Joomla creates tables as: ENGINE=InnoDB DEFAULT
CHARSET=utf8mb4 DEFAULT COLLATE=utf8mb4_general_ci

Moreover this was not documented and developers were not aware of this
possible break, so for me this is a blocker issue that should be postponed
to Joomla 4 when all extensions will be updated to create new database
tables accordingly.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9423 (comment)

Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
http://brian.teeman.net/

avatar mbabker
mbabker - comment - 15 Mar 2016

Joe please test what I posted at #9423 (comment) as that will validate the issue you are reporting or write it off as a non-issue. Even better, give us a stub database dump of the environment you ARE having issues with that can be reproduced in the same manner I have with that comment.

As long as the argument stays at "you changed the collation and charset, this must be reverted", this isn't going anywhere.

avatar spignataro
spignataro - comment - 15 Mar 2016

I have not been able to replicate a issue in any of our components thus far. Or maybe I am not looking in the right area as of yet since there are many views and areas. So as of right now I am not seeing an issue for us.

Is it possible that you can post more details of what you found and why my name was attached to this post to begin with? Because I would like to know what the issue is that you saw with our products.

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

@mbabker why the following should have issue if all core tables are involved and the collation is the same?
SELECT c.*, a.name AS asset_name, l.title AS language_title FROM jos_categories AS c LEFT JOIN jos_assets AS a ON a.id = c.asset_id LEFT JOIN jos_languages AS l ON l.lang_code = c.language;

I'm talking of mixed tables with utf8_general_ci and new tables utf8mb4_unicode_ci

MySql version is 5.1, i didn't try on newer versions of MySql if the issue is the same.

Do you want to replicate it? Create a table using utf8_general_ci as for Joomla 3.4 to map a one-to-one relationship with #__session for example, then JOIN it on the session_id primary key.

SELECT * FROM #__mycustomtable AS s LEFT JOIN #__session AS m ON s.session_id= m.session_id

avatar Bakual
Bakual - comment - 15 Mar 2016

@mbabker Testing on my XAMPP the query run fine.

Server-Typ: MariaDB
Server-Version: 10.1.10-MariaDB - mariadb.org binary distribution

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

@mbabker obviously your test query will fail if i change the collation of one of the JOINed
error_collation
field. See attachment.

avatar wilsonge
wilsonge - comment - 15 Mar 2016

The point is not what are extensions affected, nobody can be aware of what extensions are using a JOIN query like these ones. The point is that this is a B/C break, now Joomla creates tables as: ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 DEFAULT COLLATE=utf8mb4_general_ci

It has never created tables like this before 3.5.0.....

The general CI is for smart search only and is a new change made in 3.5.0 as well..... and the same line in 3.4.8 was like:

) ENGINE=InnoDB DEFAULT CHARSET=utf8;
avatar wilsonge
wilsonge - comment - 15 Mar 2016

And RE: your comment on

The point is not what are extensions affected, nobody can be aware of what extensions are using a JOIN query like these ones.

Actually that is EXACTLY the question.... every bug fix is a b/c break. The question is what extensions expect what, what extension are doing and what the likelihood of breaking many extension is. If I break a fraction of a number of extension but give big improvements to sites with non-latin characters (because of the sorting mechanisms of unicode vs general ci) then I don't mind if say 3 custom extensions break (obviously an exaggeration - but you understand my point)

I think very very few (if any) are going to join over session ID. And maybe a handful might join over menu type. I'm slightly more worried about the join on language type - but i have got several people to update sites today and people with >20 languages on their sites have reported no issues - so I'm still struggling to work out how frequently this issue occurs on the sites.

avatar anibalsanchez
anibalsanchez - comment - 15 Mar 2016

This is a theoric issue, crafted to find a collation conflict.... no real issue has been reported.

Unsubscribed too.

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

@wilsonge nobody can know how many sites/extensions will break. but you think different from me. I would never release something that could be a potential danger.

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

@anibalsanchez the error in the screenshot that i posted is not theoric issue, it's a real MySQL issue. Please stop to post stupid things.

avatar JoomliC
JoomliC - comment - 15 Mar 2016

As said in one of my previous message, this issue could be reproduced with the "only" one official Joomla extension com_weblinks : https://github.com/joomla/joomla-cms/releases
After install, go to Weblinks > links

I've found too that issue happens with the popular Phoca Gallery
After install, go to Phoca Gallery > Categories

...

In the same time, many extensions are not PHP7 ready.
As @beat said, the JED newsletter could be the way to inform.
I think we could go for it, but should inform developers. I know myself because i've tested all beta versions, and check my extension, and now i know what to do, but unfortunately, not all extensions developers check issue tracker, nor test their extensions on beta versions of Joomla (of course, this is not nice, but it's reality).

So if this colation should be changed in Joomla 3, so the 3.5 is the good version to process it, as already needs for developers to check about PHP7.
Just let's communicate now about this ?

avatar brianteeman
brianteeman - comment - 15 Mar 2016

PHP7 is a completely separate issue - lets stick to the one topic here
please

On 15 March 2016 at 15:54, Cyril Rezé notifications@github.com wrote:

As said in one of my previous message, this issue could be reproduced with
the "only" one official Joomla extension com_weblinks :
https://github.com/joomla/joomla-cms/releases
After install, go to Weblinks > links

I've found too that issue happens with the popular Phoca Gallery
After install, go to Phoca Gallery > Categories

...

In the same time, many extensions are not PHP7 ready.
As @beat https://github.com/beat said, the JED newsletter could be the
way to inform.
I think we could go for it, but should inform developers. I know because
i've tested all beta versions, and check my extension, and now i know what
to do, but unfortunately, not all extensions developers check issue
tracker, nor test their extensions on beta versions of Joomla (of course,
this is not nice, but it's reality).

So if this colation should be change in Joomla 3, so the 3.5 is the good
version to process it, as already needs for developers to check about PHP7.
Just let's communicate now about this ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9423 (comment)

Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
http://brian.teeman.net/

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

As said in one of my previous message, this issue could be reproduced with the "only" one official Joomla extension com_weblinks : https://github.com/joomla/joomla-cms/releases
After install, go to Weblinks > links

Installed the latest weblinks package (https://github.com/joomla-extensions/weblinks/releases/download/3.4.1/pkg_weblinks_3.4.1.zip) in a clean latest staging install and didn't got any error (the joomla tables are utf8mb4_unicode_ci and the weblinks table is utf8_general_ci).
Can you point the exact steps for the error to appear?

avatar brianteeman
brianteeman - comment - 15 Mar 2016

I've found too that issue happens with the popular Phoca Gallery
After install, go to Phoca Gallery > Categories

I just installed PG on a clean install of 3.5rc - created a few categories - no problem - what was I suppose to see?

avatar JoomliC
JoomliC - comment - 15 Mar 2016

This is not on a clean install! but updated install.
I will do a new test with an update from 3.4.8 to 3.5.0-rc3 (maybe the issue i have is there...) to be sure...
...

avatar brianteeman
brianteeman - comment - 15 Mar 2016

OK I will do the same then - install on 3.4.8 and then update

avatar mbabker
mbabker - comment - 15 Mar 2016

Joe did you review the SQL dump I posted? If so, you'd note that in that stub I purposefully created the jos_assets table with utf8_unicode_ci collation and a utf8 charset, the jos_categories table with utf8mb4_general_ci collation and a utf8mb4 charset, and the jos_languages table with utf8_general_ci collation and a utf8 charset. I very explicitly manipulated this SQL dump to produce a scenario where one query will join to multiple tables and each of those tables running different collations.

Also note this uses core's schema of not using foreign keys on any columns. If you're doing something that does create "real" foreign keys, that may affect test results.

You noted your MySQL platform is 5.1, so my dump as posted won't work for you (you'd have to change the utf8mb4 references on that platform).

I'm very explicitly creating a test scenario decoupled from the CMS and explicitly testing JUST the database layer. With that database as I scripted in https://gist.github.com/mbabker/40a31c33f277886f17c0 on my MySQL 5.6.29 platform I CANNOT replicate your issue with the mixed collations (not saying it doesn't exist). If there is a change in MySQL or a configuration setting that can affect this, we need to figure it out.

Please test with my explicit scenario I posted at #9423 (comment) (making any environmental changes for non-utf8mb4 support as needed). If you can, please provide a table schema and sample query (not a theoretical one, but something that is either an exact copy of something you are using which raises this issue or white labeled if the structure contains sensitive data).

avatar brianteeman
brianteeman - comment - 15 Mar 2016

@JoomliC Installed 3.4.8. installed phoca gallery. created some categories in phoca gallery. upgraded to 3.5rc3 - edited some categories in PG, created some categories in pg - no problems

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

For what i understand you will only have a collision if:

  • You join over a varchar (or text) primary key/index
  • The tables have different collation, but only in some cases.

I think the cases are...

Will HAVE a collision:

  • utf8mb4_unicode_ci -> utf8mb4_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8_general_ci (or vice versa)

Will NOT HAVE a collision:

  • same collations
  • utf8mb4_unicode_ci -> utf8_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8mb4_general_ci (or vice versa)

I tested, but please confirm.

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

Just tested com_weblinks downloaded from the JED, the issue is confirmed.
weblinks_error

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

ok you have an older mysql version! that's why you got the error and i don't

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

your tables where not converted to utf8mb4_unicode_ci, but only to utf8_unicode ci and that's a collision scenario (see my comment above)

avatar JoomliC
JoomliC - comment - 15 Mar 2016

@brianteeman do you have click on "Fix" database button ? (or do you have nothing to fix in database ?)
(will give my test feedback soon, just in 3.4.8 install for now...)

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

@JoomliC and @joomlajoe just to check, what are your mysql versions?

avatar joomlajoe
joomlajoe - comment - 15 Mar 2016

I've understood that the problem is due to the old MySql version, i used MySql 5.1 for this purpouse. But there are plenty of hosting still running MySql 5.1 and i've not still tested if MySQL 5.2+ is affected too

avatar wilsonge
wilsonge - comment - 15 Mar 2016

There's no such thing as MySQL 5.2? It jumps straight to 5.5?

avatar JoomliC
JoomliC - comment - 15 Mar 2016

@andrepereiradasilva 5.5.46 for the site with issue

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

I think 5.5.46 doesn't support utf8mb4 too.
See https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

ok, so if for old databases Joomla does not do any conversion to utf8_unicode_ci you will not have the problem right?

@richard67 can you confirm this?

avatar mbabker
mbabker - comment - 15 Mar 2016

MySQL 5.5.3 supports utf8mb4. In PHP it relies on the server version and in some cases the underlying engine. So it's actually possible for a MySQL server to support it but the PHP install to not, which is why the conversion code is somewhat complex.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

ok i see thanks @mbabker

BTW i'm no db expert just trying to help.

avatar mbabker
mbabker - comment - 15 Mar 2016

Never claimed I was either. But there are a hella lot of variables to take into account with all this. It's why I was pushing for a test scenario solely against the database instead of using extensions and the PHP code structures, that eliminates a lot of possible deviations.

avatar brianteeman
brianteeman - comment - 15 Mar 2016

This isn't working in my php3 installation
On 15 Mar 2016 4:32 pm, "Michael Babker" notifications@github.com wrote:

Never claimed I was either. But there are a hella lot of variables to take
into account with all this. It's why I was pushing for a test scenario
solely against the database instead of using extensions and the PHP code
structures, that eliminates a lot of possible deviations.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9423 (comment)

avatar mbabker
mbabker - comment - 15 Mar 2016

Just for reference, I plugged my test scenario into a MySQL 5.5.46 box I was able to get my hands on (I also added a 4th table so it had a combination of all 4 possible collations in the query). No query errors. So it seems we've got an issue ONLY on MySQL 5.1 (and probably 5.0, anyone got one of those servers lying around?).

avatar JoomliC
JoomliC - comment - 15 Mar 2016

@brianteeman Ok, so fresh install on locahost of 3.4.8, install weblinks, and then upgrade to RC3 and no issues.
In fact, i have the issue only on my online test site, where i have installed each update one after each other, since Joomla 3.2.0 (testing versions included)... (so not a concrete scenario)

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Has able to create a scenario that replicate the collision scenario on weblinks in a utf8mb4 capable server.

ALTER TABLE `#__languages` CHANGE `lang_code` `lang_code` CHAR(7) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL;
(#__ is your table prefix)
  • Now try to load weblinks. You will get the error described.

Again this is one of the scenarios described above:

Will HAVE a collision:

  • utf8mb4_unicode_ci -> utf8mb4_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8_general_ci (or vice versa)

One index is utf8_general_ci (language code in weblinks table) other is utf8_unicode_ci (language code in language table)

avatar richard67
richard67 - comment - 15 Mar 2016

@andrepereiradasilva The conversion is also run on utf8 databases, not only on utf8mb4. It changes also sizes of columns and indexes and collations (except for smart search tables) from general_ci to unicode_ci if no utf8mb4 support.

avatar JoomliC
JoomliC - comment - 15 Mar 2016

I can confirm that i have issue only if no utf8mb4 support.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

So, i think we can confirm the problem only exists in non utf8mb4 capable server because of the joins over utf8_general_ci <-> utf8_unicode_ci varchar indexes collision scenario.

avatar brianteeman
brianteeman - comment - 15 Mar 2016

Wasnt the first mysql version that suppors utf8mb4 5.5.3 which was released
SIX years agoin 2010

On 15 March 2016 at 17:00, andrepereiradasilva notifications@github.com
wrote:

So, i think we can confirm the problem only exists in non utf8mb4 capable
server because of the joins over utf8_general_ci <-> utf8_unicode_ci
indexes collision scenario.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#9423 (comment)

Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
http://brian.teeman.net/

avatar mbabker
mbabker - comment - 15 Mar 2016

Is it a non-capable utf8mb4 MySQL server or a non-capable utf8mb4 PHP configuration?

There's a reason my test comments and requests have been crafted the way they are. I want to confirm the MySQL environments that issues are being run into against. Once that is exhausted, THEN if necessary also factor in PHP environments that might have an issue where the MySQL server wouldn't.

Right now there are too many variables and too many scenarios to account for. My way very purposefully limits that scope to the smallest number of variables we can and is designed to gradually work up to account for all of them. If you want to do the crapshoot of just randomly testing random extensions and throwing pseudo SQL statements out there, then have at it. But that's just going to waste time.

I'm going to do paid work now because I do not have enough concrete data to offer any solutions to this other than my sarcastic "revert all the changes" comments from yesterday. If there is real concrete data to work with, then I'm all eyes.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

@mbabker your sql is missing a USE `joomla_stub`;

avatar mbabker
mbabker - comment - 15 Mar 2016

M'eh, I honestly just grabbed the CREATE DATABASE query out of Sequel Pro and threw it at the top of the gist. Where do I need it?

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

after the create database.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Results:
Server: 10.0.24-MariaDB - MariaDB Server
SQL database/table creation: OK
SQL SELECT Query run: OK

avatar mbabker
mbabker - comment - 15 Mar 2016

Updated. I also put the test query I'm using in the gist so that's easily referenced.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Results:
Server: 5.5.44-MariaDB-log - MariaDB Server (CentOS 7.x)
SQL database/table creation (joomla.sql): OK
SQL SELECT Query run (testquery.sql): OK

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Results:
Server: 5.1.73 - Source distribution (CentOS 6.x)
SQL database/table creation (joomla.sql): ERROR

CREATE TABLE `jos_categories` ( `id` int(11) NOT NULL AUTO_INCREMENT, `asset_id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'FK to the #__assets table.', `parent_id` int(10) unsigned NOT NULL DEFAULT '0', `lft` int(11) NOT NULL DEFAULT '0', `rgt` int(11) NOT NULL DEFAULT '0', `level` int(10) unsigned NOT NULL DEFAULT '0', `path` varchar(255) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '', `extension` varchar(50) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '', `title` varchar(255) COLLATE utf8mb4_general_ci NOT NULL, `alias` varchar(400) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '', `note` varchar(255) COLLATE utf8mb4_general_ci NOT NULL DEFAULT '', `description` longtext COLLATE utf8mb4_general_ci NOT NULL, `published` tinyint(1) NOT NULL DEFAULT '0', `checked_out` int(11) unsigned NOT NULL DEFAULT '0', `checked_out_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `access` int(10) unsigned NOT NULL DEFAULT '0[...]

#1273 - Unknown collation: 'utf8mb4_general_ci'

SQL SELECT Query run (testquery.sql): ERROR

#1146 - Table 'joomla_stub.jos_categories' doesn't exist
avatar mbabker
mbabker - comment - 15 Mar 2016

Expected since utf8mb4 doesn't exist on MySQL 5.1. Manually change the mb4 reference to plain utf8 (you'll still have mixed collations when you run things, but now it'll be 2 tables with utf8_general_ci and 2 with utf8_unicode_ci instead of all 4 tables with different collations)

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Results: (after replace all utf8mb4 text for utf8 in joomla.sql file)
Server: 5.1.73 - Source distribution (CentOS 6.x)
SQL database/table creation (joomla.sql): OK
SQL SELECT Query run (testquery.sql): OK

avatar mbabker
mbabker - comment - 15 Mar 2016

So what that tells me is the mixed collations thing ISN'T as big of an issue as it's being made out to be, or this test scenario sucks for reproducing it.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

The thing is: 3.5.0 change the tables collation even for non utf8mb4 capable DB (or trough PHP...) servers.

It change them to utf8_unicode_ci in all joomla tables (except the finder_* tables), and if they have installed extensions with utf8_general_ci collation (the default for utf8 charset i think) that use joins with varchars (menutype, lang_code, session_id, etc) there will be a collision.

This is one of the collision scenario described: utf8_unicode_ci <--> utf8_general_ci

It's not for all users, neither all extensions.

In conclusion, i think the collisions will happen on users:

  • with non utf8mb4 capable DB (or trough PHP...) servers ... (ex: CentOS 6.x)
  • that also use extensions that do joins over varchar index, being lang_code and menutype probably the most common. (ex: weblinks)
avatar mbabker
mbabker - comment - 15 Mar 2016

I very purposefully included the join on the lang_code in the scenario to explicitly test that case though. I think on the 5.1 test run you'll want to change that from general to unicode though as the categories table uses the general collation (so taking utf8mb4 out of the mix categories and languages are general and assets and users are unicode, the problem join seems to be on languages).

avatar mbabker
mbabker - comment - 15 Mar 2016

It was the test scenario. I changed the languages and users tables collations/charsets around and now on my local 5.6.29 I get the Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '=' error.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Will HAVE a collision:

  • utf8mb4_unicode_ci -> utf8mb4_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8_general_ci (or vice versa)
avatar richard67
richard67 - comment - 15 Mar 2016

What keeps us from having the lang_code column in pure ASCII in future? Maybe such can be joined with any utf8mb4 or utf8 collation without conflict? I have no idea, was just a thought.

avatar mbabker
mbabker - comment - 15 Mar 2016

Truthfully, why can't the language column be a proper foreign key as every other pseudo foreign key in the schema (pseudo because Joomla doesn't actually use or enforce them)?

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

@mbabker
you're test scenario went ok because:

SELECT c.*, a.name AS asset_name, l.title AS language_title, u.name AS user_name
FROM jos_categories AS c
LEFT JOIN jos_assets AS a ON a.id = c.asset_id (int index utf8_general_ci <--> utf8_unicode_ci)
LEFT JOIN jos_languages AS l ON l.lang_code = c.language (varchar index utf8_general_ci <--> utf8_general_ci)
LEFT JOIN jos_users AS u ON u.id = c.created_user_id; (int index utf8_general_ci <--> utf8_general_ci)

You have none of the collision scenarios (remember only for varchar indexes):

Will HAVE a collision:

  • utf8mb4_unicode_ci -> utf8mb4_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8_general_ci (or vice versa)

for having a collision either jos_categories or jos_languages and their indexes fields have to be utf8_unicode_ci

avatar mbabker
mbabker - comment - 15 Mar 2016

Ya, I figured that out finally. I just randomly changed collations on tables, wasn't paying attention specifically to what table was getting what collation (and inherently making sure a join would raise an error). Now it's in this shape:

SELECT c.*, a.name AS asset_name, l.title AS language_title, u.name AS user_name
FROM jos_categories AS c
LEFT JOIN jos_assets AS a ON a.id = c.asset_id # (int index utf8mb4_general_ci <--> utf8_unicode_ci)
LEFT JOIN jos_languages AS l ON l.lang_code = c.language # (varchar index utf8mb4_general_ci <--> utf8mb4_unicode_ci)
LEFT JOIN jos_users AS u ON u.id = c.created_user_id; # (int index utf8mb4_general_ci <--> utf8_general_ci)

So...

avatar richard67
richard67 - comment - 15 Mar 2016

Guys, maybe for the lang_code column I have a solutiion: I here changed the collation of this particular column in the languages table to ascii_general_ci and left it in the other tables (categories and so on) in @andrepereiradasilva 's join query as it was (uft8mb4_unicode_ci here), and I go no conflict:

SELECT c.*, a.name AS asset_name, l.title AS language_title, u.name AS user_name FROM j3ux0_categories AS c LEFT JOIN j3ux0_assets AS a ON a.id = c.asset_id LEFT JOIN j3ux0_languages AS l ON l.lang_code = c.language LEFT JOIN j3ux0_users AS u ON u.id = c.created_user_id ;

Ans for language codes we know they will be only ASCII!

Can this be the solution?

Can you test that with your test scenario, @mbabker ?

Or can you test that, @andrepereiradasilva ?

avatar beat
beat - comment - 15 Mar 2016

@mbabker Following your call for tests:

Here my database install:

Server type: MySQL
Server version: 5.5.47-0ubuntu0.14.04.1 - (Ubuntu)
Protocol version: 10
Server charset: UTF-8 Unicode (utf8)

Running the query on your gist https://gist.github.com/mbabker/40a31c33f277886f17c0 I get following error:

#1267 - Illegal mix of collations (utf8mb4_unicode_ci,IMPLICIT) and (utf8mb4_general_ci,IMPLICIT) for operation '='

And running your query above in your comment #9423 (comment) I get same error.

avatar richard67
richard67 - comment - 15 Mar 2016

The reason why I think this could work is that implicit conversion not fails maybe when converting from a charset which is a subset of another one, like ASCII is a subset ot UTF8 or UTF8MB4, and UTF8 is a subset of UTF8MB4, so implicit conversion from ASCII to UTF8(MB4) might work while vice versa it will not.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

@mbabker do you have the new joomla.sql file to test?

avatar mbabker
mbabker - comment - 15 Mar 2016

I haven't changed the gist, what does it need to be?

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

since in your joomla.sql file jos_categories is utf8mb4_general_ci just change jos_languages to utf8mb4_unicode_ci and you will have collisions.

avatar mbabker
mbabker - comment - 15 Mar 2016

I made that change already.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

i'm lost! what is your gist URI again?

avatar richard67
richard67 - comment - 15 Mar 2016

Guys, again, can someone try how it is with column lang_code of languages table having collation ascii_general_ci? If there are any conflicts then? I think implicit conversion of this column then to the charset of any joined column might work because ascii_general_ci is a subset of all other charsets and for ascii there is no different sorting, no unicode_ci.

Maybe that would be the solution for the languages joned by the lang_code, which as I read here is the most likely problem case.

avatar beat
beat - comment - 15 Mar 2016

@richard67 @mbabker We need to be careful with joins on indexed text columns collations: They must be same. Different collations in JOIN statements will disable the use of the indexes and ruin the performance for large tables and websites. Having different collations on joins is usually not a good idea. And it's not only me saying this, but an MySQL expert (since proven experts are in demand here): http://www.xaprb.com/blog/2012/08/20/why-wont-mysql-use-the-best-index-in-a-join/

Thus my recommendation for optimal performance has always been is to keep same collation and charsets accross the board.

avatar mbabker
mbabker - comment - 15 Mar 2016

https://gist.github.com/mbabker/40a31c33f277886f17c0

The revisions tab shows each change as I've made it. The last one I did changed the collation/charset on the languages and users tables and that started giving the mixed collations error for me.

avatar mbabker
mbabker - comment - 15 Mar 2016

@richard67 With ascii_general_ci I get a successful query.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

@mbabker with your new query, as suspect: Illegal mix of collations in all servers.

avatar richard67
richard67 - comment - 15 Mar 2016

@mbabker So it seems to work with ascii and implicit conversion.

@beat noted what it could have a performance impact but this is a theoretical thing in my eyes, because the number of languages will stay small and so I see no problems with having not optimal index use.

So I see this as a possible solution now.

Another solution of course (but breaking B/C) would be to use integer id as key and for joins (as provided by iso language number or so, have seen it somewhere) instead of the language code.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

@richard67 so, if it works, Is this possible to we use this ascii_general_ci in varchar columns only on non capable utf8mb4 servers until they got a utf8mb4 server?

Also, will have no side effects? Also is this ascii_general_ci collation supported by this servers?

Isn't easy to have no general to unicode conversion in non capable utf8mb4 servers?

avatar richard67
richard67 - comment - 15 Mar 2016

@andrepereiradasilva ascii_general_ci is supported on all mysql servers as far as i can see because is an old thing, not "new" like utf8mb4. It is the very basic charset and collation for USA for example.

It would be enough to do it for this one column only, all other language code columns e.g. in category we might be able to leave them as they are because joins most likely are done to the language table, and then the ascii is converted implicitely to the charset of the joned column (which is a superset of ascii and so implicit conversion can take place without possibility of data loss).

It would not be good for any other varchar columns except we can be sure they will be ASCII only, like it is with the language code according to iso standards.

And it also would not be good to have the column on newer utf8mb4 databases in utf8mb4 collation because there still might be other extensions in utf8 joining to it.

It realls is a special case for short varchars where we know for 100% sure they can only be ascii.

At the moment I see this only for language codes, another kind of column where this could be made were country codes, but we do not have such.

avatar richard67
richard67 - comment - 15 Mar 2016

Side effect will be that implicit conversion costs preformance and so should not be done for joins on large tables. But the languages table is small, we do not have 10 000 records, and all other tables would keep their language code column (if they have such) in their preferred collation (utf8_general_ci, utf8_unicode_ci, utf8mb4_general_ci or utf8mb4_unicode_ci).

avatar richard67
richard67 - comment - 15 Mar 2016

But as I said, is only a solution for the lang_code column of the lanuages table. If it turns out that this solution will be accepted by thew folks, I can make the PR for that.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

them i will leave to the people that decide.

avatar mbabker
mbabker - comment - 15 Mar 2016

I say if it works, use it. But I gave up decision rights in favor of glorified loud mouth status ages ago :laughing:

avatar richard67
richard67 - comment - 15 Mar 2016

Me too, I do not wanna decide that alone. Only wanted to find a possible solution for this special but most likely case of collisions.

@mbabker @wilsonge Let me know what you think about it.

avatar richard67
richard67 - comment - 15 Mar 2016

Ah Michael was faster than me, he already answered before I asked.

avatar JoomliC
JoomliC - comment - 15 Mar 2016

@richard67 i confirm that with ascii_general_ci on lang_code it works.

avatar richard67
richard67 - comment - 15 Mar 2016

Hmm, I just think maybe we really should change ALL lang_code columns wherever we have them? Or will it be enough just to change lang_code column of the languages table? What do you think, guys?

avatar mbabker
mbabker - comment - 15 Mar 2016

I only changed the column in the languages table. So I'd go for least disruptive while still getting the job done.

avatar JoomliC
JoomliC - comment - 15 Mar 2016

@richard67 changing only in languages table fixed issue for all extensions i've found with this issue, so should be enough ;-)
(stupid question: is this possible only if no utf8mb4 support ?)

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

i think utf8_bin works too and it's utf8.

avatar richard67
richard67 - comment - 15 Mar 2016

@andrepereiradasilva if this works, too, it would be even the better solution ... and just as you say it, it should have been a binary sorted column from the beginnign on, because according to ISO standard it CAN only be ascii and so sorts binary as well as general_ci or unicode_ci but much faster.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

and it also exists and utf8mb4_bin for utf8mb4 capable servers.

avatar richard67
richard67 - comment - 15 Mar 2016

But if we use utf8_bin and not ascii_bin (which also would be good), we should leave it on utf8mb4 capable systems also in utf8_bin, i.e. not use utf8mb4_bin there, so not converted extensions still can join to it.

avatar richard67
richard67 - comment - 15 Mar 2016

Right?

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

Will HAVE a collision:

  • utf8mb4_unicode_ci -> utf8mb4_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8_general_ci (or vice versa)

Will NOT HAVE a collision:

  • same collations
  • utf8mb4_unicode_ci -> utf8_general_ci (or vice versa)
  • utf8_unicode_ci -> utf8mb4_general_ci (or vice versa)

so i think there will be no collision on utf8_bin <--> utf8mb4_bin
but of course tests have to be made

avatar richard67
richard67 - comment - 15 Mar 2016

I tested the test query from Michael's gist on a 3.5.0 RC3 with utf8mb4 support, i.e. joined columns are utf8mb4_unicode_ci, and the lang_code column of languages table I had changed before to utf8_bin, and it worked.

avatar richard67
richard67 - comment - 15 Mar 2016

So it seems we have the choice, ascii_bin or utf8_bi, both seems to work. ASCII would be possible because ISO standards say this for language codes, and maybe would be faster than utf8?

avatar richard67
richard67 - comment - 15 Mar 2016

Who votes for utf8_bin and who votes for ascii_bin?

avatar richard67
richard67 - comment - 15 Mar 2016

I will prepare a PR then.

avatar beat
beat - comment - 15 Mar 2016

@richard67 My feeling is that utf8_bin will work better than ascii_bin, as there is no conversion of charset and thus it might well use indexes (didn't test yet, but adding EXPLAIN before the test-query will tell you that.

avatar mbabker
mbabker - comment - 15 Mar 2016

Doesn't make a difference. Using ascii_bin, utf8_bin, or utf8mb4_bin collation on the languages table lang_code column, my test case always gives this for the EXPLAIN SELECT... query:

screen shot 2016-03-15 at 3 03 37 pm

avatar richard67
richard67 - comment - 15 Mar 2016

Yes .. and I just have seen that eve following combination works, the lang_code column of the languages table in utf8mb4_bin, and the language column of the categories table in utf8_general_ci.

So in principle we can make the column lang_code of the languages table utf8mb4_bin on utf8mb4 capable systems, and for not utf8mb4 capable it will be changed to utf8_bin by the query downgrade, and all is fine for the languages stuff, which seems to be the most likely use case for the joins.

Can you confirm too that this works, guys?

avatar beat
beat - comment - 15 Mar 2016

@richard67 @mbabker I can confirm my feeling :smile: :

  1. Using @mbabker 's gist, and just changing in jos_languages the column lang_code to either utf8_bin or to ascii_bin work.
  2. But with utf8_bin it uses the index idx_langcode for the l query:

screenshot from 2016-03-15 20 06 23

  1. And with ascii_bin it does not:

screenshot from 2016-03-15 20 10 15

  1. But with utf8mb4_bin it does use the index key:

screenshot from 2016-03-15 20 11 54

So my vote goes against asii_bin clearly.

My prefered solution is utf8mb4_bin as it is exactly same charset as rest of Joomla 3.5, and thus is the most consistent one.

Now my vote probably doesn't count more than anyone else's, so utf8mb4_bin is just my recommendation, and i'm ok with utf8_bin too but not with asii_bin. :smile:

avatar richard67
richard67 - comment - 15 Mar 2016

@beat Thanks for the detailled information. I just prepare the PR, will be ready soon. It may take a whil until I have good testing instructions for the others, but those involved here will be able to pre-test even without instructions and so give a first impression of we have solved it.

avatar wilsonge
wilsonge - comment - 15 Mar 2016

And we are happy that language code is the only field likely to in large scale use yes?

avatar beat
beat - comment - 15 Mar 2016

@wilsonge Btw, I believe that we can use utf8mb4_bin on any text column that doesn't need sorting or searching with case-insensitivity (those need *_ci as case-insensitive collations). So e.g. in the session tables it is ok too.

I received a report that @nikosdion seems to have issues in his own Akeeba extensions as well as with FOF (question for @nikosdion : is FOF that is included in Joomla 3.5 also having issues ?) with certain columns of Joomla, so he could give a list of columns that conflict with his extensions too. Those could be good candidates for the utf8mb4_bin change.

avatar andrepereiradasilva
andrepereiradasilva - comment - 15 Mar 2016

ok here is a lot of test cases: https://gist.github.com/andrepereiradasilva/f330a9646d3d298457dd
Two files:

  • one creates the environment for the tests.
  • the second has a lot of SELECT tests and my results in front on the query.

For non capable utf8mb4 servers, all utf8mb4 related tables/data/queries must be removed.

So for the tests i run utf8_bin for varchar indexes for non utf8mb4 capable servers and utf8mb4_bin for utf8mb4 capable servers works, i.e., no collisions.

avatar richard67
richard67 - comment - 15 Mar 2016

PR is #9440 , please test. If we later see we need this change for other columns, too, and I have no time then maybe (cannot be really foreseen), then someone else can see in my PR how it has to be done.

avatar richard67
richard67 - comment - 15 Mar 2016

Maybe I should change other language code columns, eg languagein table #__categories, too? What do you think, guys?

avatar wilsonge
wilsonge - comment - 15 Mar 2016

i think we only pay attention to core library tables. categories is an extension. we don't guarentee b/c there and why would people be joining over that table?

avatar mbabker
mbabker - comment - 15 Mar 2016

Joining on the ID would be fine. If you're joining from the menus table to the categories table on the language columns, you deserve the one way ticket to the padded room that I've purchased you.

avatar richard67
richard67 - comment - 15 Mar 2016

Ok, Michael, I agree. Would be silly to join like this.

So testers are welcome here.

avatar richard67
richard67 - comment - 15 Mar 2016

PR #9440

avatar nikosdion
nikosdion - comment - 15 Mar 2016

@beat That's innacurate. I don't know why you think I'm having issues with my extensions. I'm not, at all. I am actually adding features to make the life of people using Joomla! 3.5 easier. The only extension that had an issue was CMS Update and that's the same issue as the one fixed in Joomla! Update in RC3.

Sorry for having to mention explicitly my company but you mention it by name I have to hijack your thread to explain what I've been doing today.

First, let me assure anyone reading this that Akeeba Backup is already compatible with Joomla! 3.5, utf8mb4 and PHP 7, actually since June 2015 (I'd always been testing against staging). In fact, utf8mb4 database support was added to the backup engine even earlier when WordPress introduced that feature. Do remember that my backup software runs in WordPress too, among other environments ;)

So, basically what happened today is that I noticed Joomla! 3.5.0 RC2 had made two changes regarding updates (location of the XML manifest and the way target platform versions are encoded in the update XML source). I had to update CMS Update since it wouldn't see the 3.5 releases and even if it did it wouldn't be able to install them. At that point (12 hours ago) my objective was to have CMS Update update 3.4.8 to 3.5 RC3.

Fixing CMS Update required making a change in FOF 2.x (which I copied to FOF 3.x too). The XML update source parser CMS Update uses is part of FOF. If I have to release a new FOF version I thought it would make sense to add UTF8MB4 support to FOF's schema updater, therefore completing Joomla! 3.5 support in the framework. So I did that and it took me most of the day as I wanted to add automatic utf8mb4 upgrade of all tables created through FOF's schema manager (and, conversely, downgrade utf8mb4 DDL to utf8 on servers which don't support utf8mb4 or Joomla! < 3.5).

However, adding that support without converting the schema of my existing extensions didn't make sense. Basically, I had to make sure that any indexes against VARCHAR columns were resized to avoid conversion errors and –bonus!– make querying faster. So I went through all of my extensions and, unsurprisingly, only 8 indices across 7 components and three dozen tables needed conversion. Of course going through each table and index manually takes a few hours :p

That was quite the death march due to unforeseen snowballing of the work I was doing. Maybe that's what you mistook for "having issues with my software"? I could have just released a quickly patched Akeeba CMS Update this morning but I felt it would make more sense going full UTF8MB4 across the board while I am at it.

Long story cut short, by the end of the day I will have spent 15 hours today so that anyone running Akeeba Backup on Joomla! 3.5 and MySQL 5.5.6+ can enter Emoji in the backup description and comment. LOL! I could of course not do that and let people use Akeeba Backup 4.5.5 on Joomla! 3.5, with or without utf8mb4, which is already tested and working because –unlike most Joomla! developers– I am testing against staging.

To be honest, my biggest drive behind releasing a new version of Akeeba Backup is a feature I added last week allowing you to downgrade utf8mb4 data to utf8 when you are moving to an older server that does not support utf8mb4. That's an additional feature I've added to help users move sites between servers, something which will be outright impossible to do manually: it's not enough to replace utf8mb4_ with utf8_, you also need to clobber multibyte characters (good luck with that if you DO have multibyte characters in your data...).

However, in the light if this thread I am not going to release anything unless there's a new RC or stable with the language column collation change. Frankly, I think my software is not affected by you know what they say about assumptions and donkeys :D The worst thing that can happen with the current versions of my software and utf8mb4 is exactly what already happens in 3.4.8: if you type Emoji in any text field any data after and including the Emoji will be discarded by MySQL without an error. Since Akeeba Backup, which is the most important extension in my portfolio, already works fine with Joomla! 3.5 RC3 with or without utf8mb4 I am not worried all that much. I am just annoyed that after all the planning and work I put today I can still NOT release anything until yet another last minute change is made. If there's an RC4 it means that my death march today was unnecessary and I will still have to work a full day (instead of taking half day off) tomorrow. Sigh...

Regarding FOF, I don't have the time to update it in the core with the tight deadline you have. Work for making a living takes priority. More so when the extra, unscheduled work is the direct result of Joomla! making last minute changes after what was supposed to be the last RC ;)

Sorry for the long post, here's a potato (soon you'll be able to type that in Joomla! content as well!) ????

avatar beat
beat - comment - 15 Mar 2016

@nikosdion Sorry for the wrong impression I got from your Facebook message and the hashtag used there. Happy to hear that all your extensions work fine and that your work today on all your extensions was in fact not urgent in relation with Joomla 3.5. Btw, this is not "my" thread: I didn't start it and my extensions work fine with Joomla master since months too. I just tried to help here find a solution for com_weblinks and other possible B/C issues.

avatar wilsonge
wilsonge - comment - 15 Mar 2016

I'm going to close this in favour of the PR. I think we are happy there are no more fields to update beyond the one we have talked about

avatar wilsonge wilsonge - change - 15 Mar 2016
Status New Closed
Closed_Date 0000-00-00 00:00:00 2016-03-15 21:43:27
Closed_By wilsonge
avatar wilsonge wilsonge - close - 15 Mar 2016
avatar brianteeman brianteeman - close - 15 Mar 2016
avatar brianteeman brianteeman - change - 22 Apr 2016
Labels Added: ?
avatar mjl778374
mjl778374 - comment - 28 Aug 2019

Hi. If you have phpMyAdmin installed, you can follow the instructions given in the following link: https://mediatemple.net/community/products/dv/204403914/default-mysql-character-set-and-collation You have to match the collate of the database with that of all the tables, as well as the fields of the tables and then recompile all the stored procedures and functions. With that everything should work again.

Add a Comment

Login with GitHub to post a comment