User tests: Successful: Unsuccessful:
Pinging @wilsonge and @Bakual about this.
Joomla! uses the UTF-8 character set to acquire, process and store information in the database. Due to the way MySQL 5.5.2 and earlier versions handle UTF-8 any Unicode character requiring 4 bytes for storage leads to text gone missing. This is both annoying (loss of information) and potentially dangerous (see Unicode security).
This PR addresses the issue in two ways:
This PR is fully backwards compatible. Data is migrated automatically on upgrade. Existing extensions (core and third party) don't have to change anything in their code, everything is handled at the database lever.
However, if you have written your own custom database driver you MUST implement utf8mb4 support. If you don't there is no actual problem: your code will STILL work fine. The 4-byte Unicode characters will be stripped off. This is NOT a backwards compatibility break. Before this PR a 4-byte Unicode character would lead to loss of data. After the PR the 4-byte character is replaced with the Unicode "uncertainty" character and there is no loss of data after it. Therefore even if you don't bother upgrading your database driver this PR will still guard you against data loss and potential security implications. You're welcome.
None (not applicable)
Go to Content, Article Manager, Add New Article
Enter the following title:
Title ???? and it's gone!
Enter the following content:
This part of the text works. ???? The devil lies in the details.
IMPORTANT Make sure you enter the text above using an actual Emoji, NOT a picture reference, i.e. do NOT copy-paste the text from GitHub
Click on Save
WHOOPS! The title now only reads Title
and the body text only reads This part of the text works.
. That's because MySQL eats up the 4-byte Unicode characters using the utf8 encoding :(
There should be no line reading Database Connection Collation
Apply patch
Alternatively just install a new site using the ZIP file of the PR's branch
If it reads "utf8_general_ci" please confirm that the Database Version reported is less than 5.5.3. If this is the case please skip to "Testing on MySQL 5.5.2 or earlier (including 5.1)"
If you get a question mark instead of the chinese symbol and the Emoji you have forgotten to go through the Extensions, Extension Manager, Database, Fix routine. Still, even though you screwed up this PR made sure you don't get truncated text therefore mitigating the issue.
If, however, you got either a blank rectangle or a question mark inside a hollow diamond this means that either your PHP version is very old or the PHP mysql, mysqli, mysqlnd or PDO driver is linked against a very old MySQL client library. In this case please consult "Testing on MySQL 5.5.2 or earlier (including 5.1)". That's a limitation of PHP but STILL you get proper mitigation of the issue.
We try to guess if MySQL supports utf8mb4 by first looking at the driver version. If the PHP driver (mysql, mysqli, mysqlnd, PDO) doesn't support utf8mb4 we drop back to utf8. If the driver supports utf8mb4 we try to connect using the utf8mb4 client encoding. If that fails we retry connecting with the utf8 encoding and mark the driver as NOT supporting utf8mb4. This is normal and expected when you are using MySQL server 5.5.2 or earlier (commonly: 5.1.x) with a relatively new PHP version.
JInputFilter now has a $stripUSC
parameter which tells it whether it should strip 4-byte Unicode characters. If this is not set, or set to -1, JInputFilter will get a reference to the database object through JFactory and ask it if it supports utf8mb4. If it doesn't the $stripUSC
flag is set to 1 (strip). In this case all 4-byte Unicode characters are replaced with the 3-byte Unicode "uncertainty" character which renders either as a small blank rectangle or as a question mark inside a hollow diamond. This prevents the MySQL issue of losing all text after such a character and its potential security implications.
Upon upgrade all core tables are upgraded to utf8mb4 by virtue of the 3.5.0-2015-01-01.sql script. DO NOT CHANGE THE NAME OF THIS SCRIPT!!! The name HAD to be hard-coded in com_installer's Database, Fix action to allow this feature to handle upgrade to utf8mb4. Unfortuantely there is no simple way to either list core tables (without having a hardcoded list) or tell the default character set of a table. Therefore the only way to ensure the database is up-to-date –especially when transferring a site from a MySQL 5.1 server to a MySQL 5.5.3+ server– is always running this conversion script at the end of the database fix.
All CREATE TABLE and ALTER TABLE queries MUST be written using the utf8mb4 character set and encodings. If Joomla! detects that the database does not support utf8mb4 it will automatically downgrade them to utf8. This simplifies installation.
Should you need to upgrade a table from PPH code you can use JFactory::getDbo()->alterTableCharacterSet('#__your_table')
.
Finally, the JDatabaseDriver now has two methods getName and getServerType to get the name of the driver (works like $db->name) and the server family type. The latter is used to determine whether we're running under MySQL.
This section is meant for inclusion in our documentation
The following applies only on MySQL databases. Other database servers are not affected.
Joomla! up to 3.4 uses the "utf8" character set for its database. This means that MySQL will use 1 to 3 bytes to store each Unicode character. The problem is that certain 4-byte Unicode characters such as Emoji, extended Chinese and others –collectively called Unicode Supplementary Characters– cannot be expressed under this scheme. Instead of simply dropping the offending character, MySQL will drop not just this character but all the text that follows it. This is both annoying (data loss) and a potential security issue (depending on how data is rendered by your component).
Joomla! 3.5.0 onwards uses the backwards-compatible utf8mb4 encoding when available, namely when Joomla! is installed on a server using MySQL 5.5.3 or later and an up-to-date version of the mysql, mysqli or mysqlng PHP extension. On top of that, Joomla! will filter out 4-byte characters when they cannot be stored in the database, therefore preventing this data loss and potential security issue.
From your PHP code's point of view you do not have to make any change whatsoever in your PHP code to support it. Joomla! does all the hard work for you! You SHOULD, however, make changes to your extensions' schema installation (.sql) files to support storing 4-byte long Unicode Supplementary Characters.
Your schema files MUST ALWAYS use the utf8mb4 character set when creating or changing tables (CREATE TABLE and ALTER TABLE queries). Joomla! will figure out at extension installation / update time whether the MySQL server supports utf8mb4 or not. If it doesn't, Joomla! will automatically convert your query to use the old utf8 character set and the respective utf8_* collations instead of the utf8mb4 character set and utf8mb4_* collations in your .sql file.
Example:
CREATE TABLE IF NOT EXISTS `#__example_items` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_greek_ci NOT NULL DEFAULT '',
`alias` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '',
`mytext` mediumtext NOT NULL,
`state` tinyint(3) NOT NULL DEFAULT 0,
`catid` int(10) unsigned NOT NULL DEFAULT 0,
`created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`created_by` int(10) unsigned NOT NULL DEFAULT 0,
`created_by_alias` varchar(255) NOT NULL DEFAULT '',
`modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`modified_by` int(10) unsigned NOT NULL DEFAULT 0,
`checked_out` int(10) unsigned NOT NULL DEFAULT 0,
`checked_out_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`publish_up` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`publish_down` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`ordering` int(11) NOT NULL DEFAULT 0,
`access` int(10) unsigned NOT NULL DEFAULT 0,
`language` char(7) NOT NULL COMMENT 'The language code for the article.',
PRIMARY KEY (`id`),
KEY `idx_access` (`access`),
KEY `idx_checkout` (`checked_out`),
KEY `idx_state` (`state`),
KEY `idx_catid` (`catid`),
KEY `idx_createdby` (`created_by`),
KEY `idx_language` (`language`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 DEFAULT COLLATE=utf8mb4_general_ci;
As you can see, everything is set up to use utf8mb4. If a user tries to install your extension on a server using MySQL 5.1 Joomla! will automatically rewrite the query above as:
CREATE TABLE IF NOT EXISTS `#__example_items` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) CHARACTER SET utf8 COLLATE utf8_greek_ci NOT NULL DEFAULT '',
`alias` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
`mytext` mediumtext NOT NULL,
`state` tinyint(3) NOT NULL DEFAULT 0,
`catid` int(10) unsigned NOT NULL DEFAULT 0,
`created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`created_by` int(10) unsigned NOT NULL DEFAULT 0,
`created_by_alias` varchar(255) NOT NULL DEFAULT '',
`modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`modified_by` int(10) unsigned NOT NULL DEFAULT 0,
`checked_out` int(10) unsigned NOT NULL DEFAULT 0,
`checked_out_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`publish_up` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`publish_down` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`ordering` int(11) NOT NULL DEFAULT 0,
`access` int(10) unsigned NOT NULL DEFAULT 0,
`language` char(7) NOT NULL COMMENT 'The language code for the article.',
PRIMARY KEY (`id`),
KEY `idx_access` (`access`),
KEY `idx_checkout` (`checked_out`),
KEY `idx_state` (`state`),
KEY `idx_catid` (`catid`),
KEY `idx_createdby` (`created_by`),
KEY `idx_language` (`language`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 DEFAULT COLLATE=utf8_general_ci;
You DO NOT and MUST NOT write a query using utf8 manually. Let Joomla! do it for you!
If you do not use utf8mb4 in your tables, Joomla! is running on a MySQL server supporting utf8mb4 and the user enters a Unicode Supplementary Character no real harm is done (since Joomla! 3.5.0): the character is automatically replaced by MySQL with a question mark (?).
When a user updates their site to Joomla! 3.5.0 only the core database tables will be converted to utf8mb4. The update will not convert third party extensions' tables. Likewise, the "mysql" schema installation scripts are not automatically converted to utf8mb4. This is by design and has to do with MySQL's key length.
If you are using InnoDB tables the maximum size for an index / key is 767 bytes. When using the old "utf8" character set each character of a CHAR or VARCHAR column counts as 3 bytes, whereas under the utf8mb4 character set each character counts as 4 bytes. This means that your old indices will probably be too big.
A practical example can be found in the #__menu
core table. Before Joomla! 3.5 there was this index:
UNIQUE KEY `idx_client_id_parent_id_alias_language` (`client_id`,`parent_id`,`alias`,`language`),
Client ID is a tinyint (1 byte), parent_id an int (4 bytes), language a 7-character CHAR column (21 bytes under utf8, 28 under utf8mb4) and alias is a 255 character VARCHAR (765 bytes under utf8, 1020 bytes under utf8mb4). As you can understand, the index becomes too long for MySQL to handle under utf8mb4.
There is a workaround to that. We can tell MySQL to use the first n bytes of a column for the index. In our case, we can tell MySQL to use the first 191 characters of the alias column, therefore reserving 191 characters x 4 bytes per character = 764 bytes which is less than the 767 bytes limit of InnoDB. You can also see why trying to use 192 characters would fail: 192 characters x 4 bytes per character = 768 bytes which is one byte too long for MySQL's index limit.
In order to fix the index we need to drop and create it again in a single query:
ALTER TABLE `#__MENU` DROP KEY `IDX_CLIENT_ID_PARENT_ID_ALIAS_LANGUAGE`, ADD UNIQUE KEY `idx_client_id_parent_id_alias_language` (`client_id`,`parent_id`,`alias`(191),`language`);
Please keep in mind that the DROP KEY and ADD KEY should be on the same query, separated by a comma.
After having resized the index we need to convert any columns which do not use the same collation as the rest of the table. In #__menu
's case there's the alias
column which uses the utf8_bin collation. We have to convert it to the utf8mb4_bin collation instead. It's a simple ALTER TABLE command:
ALTER TABLE `#__menu` MODIFY `alias` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'The SEF alias of the menu item.';
Finally, we will have to convert the table itself to utf8mb4 encoding.
IMPORTANT: Do NOT try to convert the table before resizing the indices, it will lead to an error!
The query to convert a table to utf8mb4 is as straightforward as it gets:
ALTER TABLE `#__menu` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
So, the complete SQL script to convert the #__menu
table is:
ALTER TABLE `#__MENU` DROP KEY `IDX_CLIENT_ID_PARENT_ID_ALIAS_LANGUAGE`, ADD UNIQUE KEY `idx_client_id_parent_id_alias_language` (`client_id`,`parent_id`,`alias`(191),`language`);
ALTER TABLE `#__menu` MODIFY `alias` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'The SEF alias of the menu item.';
ALTER TABLE `#__menu` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Now you might think, but what about my users who are NOT using utf8mb4? As we already saw, Joomla! is smart enough to figure this out for you. It automatically converts the queries referencing the utf8mb4 character set and collations to utf8. So you do not need to worry about it, Joomla! will sort everything out for you as long as you use utf8mb4 in your .sql files.
You SHOULD make small changes to your database driver unless you are extending from JDatabaseDriverMysqli
or JDatabaseDriverMysql
AND you have not overridden the setUTF
method. These changes are not necessary, but highly recommended.
First you need to do:
$this->serverType = 'mysql';
$this->utf8mb4 = true;
in your __construct
method.
Secondly you need to copy the setUTF
method from JDatabaseDriverMysqli
or JDatabaseDriverMysql
. This method determines if utf8mb4 is supported by the database server. If you fail to do that you might end up with broken text.
Category | ⇒ | MS SQL Postgresql SQL |
Status | New | ⇒ | Pending |
Labels |
Added:
?
?
|
||
Build | 3.5-dev | ⇒ | staging |
Labels |
Added:
?
|
Labels |
Added:
?
|
This PR produces for Travis a PHP Error in the unit tests?
PHP Fatal error: Class JDatabaseDriverNosql contains 1 abstract method and must therefore be declared abstract or implement the remaining methods (JDatabaseDriver::getConnectionCollation) in /home/travis/build/joomla/joomla-cms/tests/unit/suites/libraries/joomla/database/stubs/nosqldriver.php on line 18
BTW here is a Travis CS Issues fix PR nikosdion#9
Thanks
Build | 3.5-dev | ⇒ | staging |
The latest commit should fix that issue.
More unit tests to amend yet :(
Build | 3.5-dev | ⇒ | staging |
@wilsonge I fixed the tests and I also made a small change. I changed the new JDatabaseDriver::getConnectionCollation method from abstract to concrete (returning an empty string) to prevent backwards compatibility issues with existing custom database drivers created by advanced site integrators and third party developers.
I did a sql info here and my version is 5.5.25.
In PHPMyadmin I have the choice for utf8mb4 collations
@infograf768 Yes, you do (5.5.25 is 22 versions higher than the minimum requirement of MySQL 5.5.3 for this feature), but Joomla! didn't support the utf8mb4 connection collation. This means that even if you manually converted your tables you wouldn't be able to use 4-byte Unicode characters. This is one of the things fixed by this PR, per the (very detailed) description I wrote :)
I meant that your check for sql version is >= 5.5.3
return version_compare($client_version, '5.5.3', '>=');
while it seems the feature is available before (5.5.25).
Evidently J! has to be modified too.
is there something I don't understand?
No, Jean-Marie. Please consult this page to understand how version numbers work. As you can see 5.5.3 is followed by 5.5.4, 5.5.5, ..., 5.5.9, 5.5.10, 5.5.11, ..., 5.5.19, 5.5.20, 5.5.21, ..., 5.5.24, 5.5.25.
To give you some perspective: if you have a site running Joomla! 2.5.3 are you twenty five releases behind the latest version (2.5.28) or not? Think about it. You already know that, why did you get stuck on this now?
Ooops.... Evidently. Me even older than I thought. Saw a period when none...
:)
Build | 3.5-dev | ⇒ | staging |
OK, the last notice from phpcs is fixed.
Now that the Unit Tests are all green could some people please test it on their servers? I've tried it with PHP 5.4, 5.5 and 5.6 and MySQL 5.1 and 5.5 on Mac OS X and Linux. Any other environment will be highly appreciated.
After following the instructions and using Unicode Supplementary Characters Test Data from http://www.i18nguy.com/unicode/supplementary-test.html#utf8 I can confirm that the issue is resolved.
Tested with php 5.5.23 and 5.3.29 and mysql 5.5.42 on OSX
Test Ok, now we are talking ... an article with a SMILING FACE on title and alias!!
Tested on php 5.5.9 and MySQL 5.5.43 on Linux Mint 17.1
Status | Pending | ⇒ | Ready to Commit |
Setting RTC
Labels |
Added:
?
|
Milestone |
Added: |
Milestone |
Added: |
Milestone |
Removed: |
Status | Ready to Commit | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2015-07-09 20:45:50 |
Closed_By | ⇒ | Kubik-Rubik |
Status | Closed | ⇒ | New |
Closed_Date | 2015-07-09 20:45:50 | ⇒ | |
Closed_By | Kubik-Rubik | ⇒ |
Status | New | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2015-07-10 07:08:15 |
Closed_By | ⇒ | wilsonge |
Labels |
Removed:
?
|
Labels |
Added:
?
|
There are VERY important reasons for this PR do not blindly revert
@brianteeman see: #8268 I'm not reverting I'm fixing but there is still one open point ;)
@nikosdion The changing of the alias to 191 chars is causing all sort of issues around the place.
After @zero-24 lost his PR due to a mistake rebasing it we merged in manually everything except his alias changes and @roland-d has redone that in #8472 however there have been all sorts of issues raised in that PR. I've been rereading your description of by the 191 character change was made but what I'm not clear on is that does this mean we are going to loose existing characters in alias' in sites etc? If you could reply on #8472 it would be hugely appreciated :)
Byte count. 255 quad octets are larger than what MySQL allows plain and simple. If you increase that limit stuff breaks. If you don't bc breaks. We can't change how MySQL works. I explained everything in my PR months ago.
New language string seemed odd but after reading the mysql docs I understand it and it makes sense so good to go for the new string