? ? ? Success

User tests: Successful: Unsuccessful:

avatar nikosdion
nikosdion
14 Jun 2015

Pinging @wilsonge and @Bakual about this.

Executive summary

Joomla! uses the UTF-8 character set to acquire, process and store information in the database. Due to the way MySQL 5.5.2 and earlier versions handle UTF-8 any Unicode character requiring 4 bytes for storage leads to text gone missing. This is both annoying (loss of information) and potentially dangerous (see Unicode security).

This PR addresses the issue in two ways:

  • On MySQL 5.5.3 and later we are using the backwards compatible utf8mb4 encoding which resolves this issue
  • On MySQL 5.5.2 and earlier we filter out 4-byte characters in JInputFilter itself, replacing them with the Unicode ambiguity symbol, preventing loss of text.

Backwards compatibility

This PR is fully backwards compatible. Data is migrated automatically on upgrade. Existing extensions (core and third party) don't have to change anything in their code, everything is handled at the database lever.

However, if you have written your own custom database driver you MUST implement utf8mb4 support. If you don't there is no actual problem: your code will STILL work fine. The 4-byte Unicode characters will be stripped off. This is NOT a backwards compatibility break. Before this PR a 4-byte Unicode character would lead to loss of data. After the PR the 4-byte character is replaced with the Unicode "uncertainty" character and there is no loss of data after it. Therefore even if you don't bother upgrading your database driver this PR will still guard you against data loss and potential security implications. You're welcome.

Translation impact

None (not applicable)

Test instructions

Reproducing the issue

  • Go to Content, Article Manager, Add New Article

  • Enter the following title:
    Title ???? and it's gone!

  • Enter the following content:
    This part of the text works. ???? The devil lies in the details.

IMPORTANT Make sure you enter the text above using an actual Emoji, NOT a picture reference, i.e. do NOT copy-paste the text from GitHub

  • Click on Save

  • WHOOPS! The title now only reads Title and the body text only reads This part of the text works.. That's because MySQL eats up the 4-byte Unicode characters using the utf8 encoding :(

Installation

  • Go to System, System Information
  • There should be no line reading Database Connection Collation

  • Apply patch

  • Go to Extensions, Extension Manager, Database and click on Fix
  • The fix should complete successfully

Alternatively just install a new site using the ZIP file of the PR's branch

Verify the connection

  • Go to System, System Information
  • The fourth line in the System Information tab should now read "Database Connection Collation: utf8mb4_general_ci"

If it reads "utf8_general_ci" please confirm that the Database Version reported is less than 5.5.3. If this is the case please skip to "Testing on MySQL 5.5.2 or earlier (including 5.1)"

Testing under PHP 5.5.3 and later

  • Repeat the instructions under "Reproducing the issue"
  • The title and content now display in full (including the Chinese character and the Emoji). Success!!

If you get a question mark instead of the chinese symbol and the Emoji you have forgotten to go through the Extensions, Extension Manager, Database, Fix routine. Still, even though you screwed up this PR made sure you don't get truncated text therefore mitigating the issue.

If, however, you got either a blank rectangle or a question mark inside a hollow diamond this means that either your PHP version is very old or the PHP mysql, mysqli, mysqlnd or PDO driver is linked against a very old MySQL client library. In this case please consult "Testing on MySQL 5.5.2 or earlier (including 5.1)". That's a limitation of PHP but STILL you get proper mitigation of the issue.

Testing on MySQL 5.5.2 or earlier (including 5.1)

  • Repeat the instructions under "Reproducing the issue"
  • The Chinese character and the Emoji appear as either an empty square or a question mark in a diamond. However, the text following them did NOT disappear.

Notes and technical information

We try to guess if MySQL supports utf8mb4 by first looking at the driver version. If the PHP driver (mysql, mysqli, mysqlnd, PDO) doesn't support utf8mb4 we drop back to utf8. If the driver supports utf8mb4 we try to connect using the utf8mb4 client encoding. If that fails we retry connecting with the utf8 encoding and mark the driver as NOT supporting utf8mb4. This is normal and expected when you are using MySQL server 5.5.2 or earlier (commonly: 5.1.x) with a relatively new PHP version.

JInputFilter now has a $stripUSC parameter which tells it whether it should strip 4-byte Unicode characters. If this is not set, or set to -1, JInputFilter will get a reference to the database object through JFactory and ask it if it supports utf8mb4. If it doesn't the $stripUSC flag is set to 1 (strip). In this case all 4-byte Unicode characters are replaced with the 3-byte Unicode "uncertainty" character which renders either as a small blank rectangle or as a question mark inside a hollow diamond. This prevents the MySQL issue of losing all text after such a character and its potential security implications.

Upon upgrade all core tables are upgraded to utf8mb4 by virtue of the 3.5.0-2015-01-01.sql script. DO NOT CHANGE THE NAME OF THIS SCRIPT!!! The name HAD to be hard-coded in com_installer's Database, Fix action to allow this feature to handle upgrade to utf8mb4. Unfortuantely there is no simple way to either list core tables (without having a hardcoded list) or tell the default character set of a table. Therefore the only way to ensure the database is up-to-date –especially when transferring a site from a MySQL 5.1 server to a MySQL 5.5.3+ server– is always running this conversion script at the end of the database fix.

All CREATE TABLE and ALTER TABLE queries MUST be written using the utf8mb4 character set and encodings. If Joomla! detects that the database does not support utf8mb4 it will automatically downgrade them to utf8. This simplifies installation.

Should you need to upgrade a table from PPH code you can use JFactory::getDbo()->alterTableCharacterSet('#__your_table').

Finally, the JDatabaseDriver now has two methods getName and getServerType to get the name of the driver (works like $db->name) and the server family type. The latter is used to determine whether we're running under MySQL.

Special considerations for third party developers

This section is meant for inclusion in our documentation

The following applies only on MySQL databases. Other database servers are not affected.

utf8 vs utf8mb4

Joomla! up to 3.4 uses the "utf8" character set for its database. This means that MySQL will use 1 to 3 bytes to store each Unicode character. The problem is that certain 4-byte Unicode characters such as Emoji, extended Chinese and others –collectively called Unicode Supplementary Characters– cannot be expressed under this scheme. Instead of simply dropping the offending character, MySQL will drop not just this character but all the text that follows it. This is both annoying (data loss) and a potential security issue (depending on how data is rendered by your component).

Joomla! 3.5.0 onwards uses the backwards-compatible utf8mb4 encoding when available, namely when Joomla! is installed on a server using MySQL 5.5.3 or later and an up-to-date version of the mysql, mysqli or mysqlng PHP extension. On top of that, Joomla! will filter out 4-byte characters when they cannot be stored in the database, therefore preventing this data loss and potential security issue.

From your PHP code's point of view you do not have to make any change whatsoever in your PHP code to support it. Joomla! does all the hard work for you! You SHOULD, however, make changes to your extensions' schema installation (.sql) files to support storing 4-byte long Unicode Supplementary Characters.

Writing your schema files to support extended characters

Your schema files MUST ALWAYS use the utf8mb4 character set when creating or changing tables (CREATE TABLE and ALTER TABLE queries). Joomla! will figure out at extension installation / update time whether the MySQL server supports utf8mb4 or not. If it doesn't, Joomla! will automatically convert your query to use the old utf8 character set and the respective utf8_* collations instead of the utf8mb4 character set and utf8mb4_* collations in your .sql file.

Example:

CREATE TABLE IF NOT EXISTS `#__example_items` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_greek_ci NOT NULL DEFAULT '',
  `alias` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL DEFAULT '',
  `mytext` mediumtext NOT NULL,
  `state` tinyint(3) NOT NULL DEFAULT 0,
  `catid` int(10) unsigned NOT NULL DEFAULT 0,
  `created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `created_by` int(10) unsigned NOT NULL DEFAULT 0,
  `created_by_alias` varchar(255) NOT NULL DEFAULT '',
  `modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `modified_by` int(10) unsigned NOT NULL DEFAULT 0,
  `checked_out` int(10) unsigned NOT NULL DEFAULT 0,
  `checked_out_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `publish_up` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `publish_down` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `ordering` int(11) NOT NULL DEFAULT 0,
  `access` int(10) unsigned NOT NULL DEFAULT 0,
  `language` char(7) NOT NULL COMMENT 'The language code for the article.',
  PRIMARY KEY (`id`),
  KEY `idx_access` (`access`),
  KEY `idx_checkout` (`checked_out`),
  KEY `idx_state` (`state`),
  KEY `idx_catid` (`catid`),
  KEY `idx_createdby` (`created_by`),
  KEY `idx_language` (`language`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 DEFAULT COLLATE=utf8mb4_general_ci;

As you can see, everything is set up to use utf8mb4. If a user tries to install your extension on a server using MySQL 5.1 Joomla! will automatically rewrite the query above as:

CREATE TABLE IF NOT EXISTS `#__example_items` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(255) CHARACTER SET utf8 COLLATE utf8_greek_ci NOT NULL DEFAULT '',
  `alias` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
  `mytext` mediumtext NOT NULL,
  `state` tinyint(3) NOT NULL DEFAULT 0,
  `catid` int(10) unsigned NOT NULL DEFAULT 0,
  `created` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `created_by` int(10) unsigned NOT NULL DEFAULT 0,
  `created_by_alias` varchar(255) NOT NULL DEFAULT '',
  `modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `modified_by` int(10) unsigned NOT NULL DEFAULT 0,
  `checked_out` int(10) unsigned NOT NULL DEFAULT 0,
  `checked_out_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `publish_up` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `publish_down` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `ordering` int(11) NOT NULL DEFAULT 0,
  `access` int(10) unsigned NOT NULL DEFAULT 0,
  `language` char(7) NOT NULL COMMENT 'The language code for the article.',
  PRIMARY KEY (`id`),
  KEY `idx_access` (`access`),
  KEY `idx_checkout` (`checked_out`),
  KEY `idx_state` (`state`),
  KEY `idx_catid` (`catid`),
  KEY `idx_createdby` (`created_by`),
  KEY `idx_language` (`language`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 DEFAULT COLLATE=utf8_general_ci;

You DO NOT and MUST NOT write a query using utf8 manually. Let Joomla! do it for you!

If you do not use utf8mb4 in your tables, Joomla! is running on a MySQL server supporting utf8mb4 and the user enters a Unicode Supplementary Character no real harm is done (since Joomla! 3.5.0): the character is automatically replaced by MySQL with a question mark (?).

Updating your existing tables

When a user updates their site to Joomla! 3.5.0 only the core database tables will be converted to utf8mb4. The update will not convert third party extensions' tables. Likewise, the "mysql" schema installation scripts are not automatically converted to utf8mb4. This is by design and has to do with MySQL's key length.

If you are using InnoDB tables the maximum size for an index / key is 767 bytes. When using the old "utf8" character set each character of a CHAR or VARCHAR column counts as 3 bytes, whereas under the utf8mb4 character set each character counts as 4 bytes. This means that your old indices will probably be too big.

A practical example can be found in the #__menu core table. Before Joomla! 3.5 there was this index:

UNIQUE KEY `idx_client_id_parent_id_alias_language` (`client_id`,`parent_id`,`alias`,`language`),

Client ID is a tinyint (1 byte), parent_id an int (4 bytes), language a 7-character CHAR column (21 bytes under utf8, 28 under utf8mb4) and alias is a 255 character VARCHAR (765 bytes under utf8, 1020 bytes under utf8mb4). As you can understand, the index becomes too long for MySQL to handle under utf8mb4.

There is a workaround to that. We can tell MySQL to use the first n bytes of a column for the index. In our case, we can tell MySQL to use the first 191 characters of the alias column, therefore reserving 191 characters x 4 bytes per character = 764 bytes which is less than the 767 bytes limit of InnoDB. You can also see why trying to use 192 characters would fail: 192 characters x 4 bytes per character = 768 bytes which is one byte too long for MySQL's index limit.

In order to fix the index we need to drop and create it again in a single query:

ALTER TABLE `#__MENU` DROP KEY `IDX_CLIENT_ID_PARENT_ID_ALIAS_LANGUAGE`, ADD UNIQUE KEY `idx_client_id_parent_id_alias_language` (`client_id`,`parent_id`,`alias`(191),`language`);

Please keep in mind that the DROP KEY and ADD KEY should be on the same query, separated by a comma.

After having resized the index we need to convert any columns which do not use the same collation as the rest of the table. In #__menu's case there's the alias column which uses the utf8_bin collation. We have to convert it to the utf8mb4_bin collation instead. It's a simple ALTER TABLE command:

ALTER TABLE `#__menu` MODIFY `alias` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'The SEF alias of the menu item.';

Finally, we will have to convert the table itself to utf8mb4 encoding.

IMPORTANT: Do NOT try to convert the table before resizing the indices, it will lead to an error!

The query to convert a table to utf8mb4 is as straightforward as it gets:

ALTER TABLE `#__menu` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

So, the complete SQL script to convert the #__menu table is:

ALTER TABLE `#__MENU` DROP KEY `IDX_CLIENT_ID_PARENT_ID_ALIAS_LANGUAGE`, ADD UNIQUE KEY `idx_client_id_parent_id_alias_language` (`client_id`,`parent_id`,`alias`(191),`language`);
ALTER TABLE `#__menu` MODIFY `alias` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL COMMENT 'The SEF alias of the menu item.';
ALTER TABLE `#__menu` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Now you might think, but what about my users who are NOT using utf8mb4? As we already saw, Joomla! is smart enough to figure this out for you. It automatically converts the queries referencing the utf8mb4 character set and collations to utf8. So you do not need to worry about it, Joomla! will sort everything out for you as long as you use utf8mb4 in your .sql files.

If you have written a custom MySQL database driver

You SHOULD make small changes to your database driver unless you are extending from JDatabaseDriverMysqli or JDatabaseDriverMysql AND you have not overridden the setUTF method. These changes are not necessary, but highly recommended.

First you need to do:

$this->serverType = 'mysql';
$this->utf8mb4 = true;

in your __construct method.

Secondly you need to copy the setUTF method from JDatabaseDriverMysqli or JDatabaseDriverMysql. This method determines if utf8mb4 is supported by the database server. If you fail to do that you might end up with broken text.

avatar nikosdion nikosdion - open - 14 Jun 2015
avatar brianteeman
brianteeman - comment - 14 Jun 2015

New language string seemed odd but after reading the mysql docs I understand it and it makes sense so good to go for the new string

avatar zero-24 zero-24 - change - 14 Jun 2015
Category MS SQL Postgresql SQL
avatar zero-24 zero-24 - change - 14 Jun 2015
Status New Pending
avatar zero-24 zero-24 - change - 14 Jun 2015
Labels Added: ? ?
Build 3.5-dev staging
avatar zero-24 zero-24 - change - 14 Jun 2015
Labels Added: ?
avatar zero-24 zero-24 - change - 14 Jun 2015
Labels Added: ?
avatar zero-24
zero-24 - comment - 14 Jun 2015

This PR produces for Travis a PHP Error in the unit tests?

PHP Fatal error:  Class JDatabaseDriverNosql contains 1 abstract method and must therefore be declared abstract or implement the remaining methods (JDatabaseDriver::getConnectionCollation) in /home/travis/build/joomla/joomla-cms/tests/unit/suites/libraries/joomla/database/stubs/nosqldriver.php on line 18

BTW here is a Travis CS Issues fix PR :smile: nikosdion#9

Thanks :smile:

avatar nikosdion nikosdion - change - 14 Jun 2015
Build 3.5-dev staging
avatar nikosdion nikosdion - reference | 0317281 - 14 Jun 15
avatar nikosdion
nikosdion - comment - 14 Jun 2015

The latest commit should fix that issue.

avatar wilsonge
wilsonge - comment - 14 Jun 2015

More unit tests to amend yet :(

avatar nikosdion nikosdion - change - 15 Jun 2015
Build 3.5-dev staging
avatar nikosdion
nikosdion - comment - 15 Jun 2015

@wilsonge I fixed the tests and I also made a small change. I changed the new JDatabaseDriver::getConnectionCollation method from abstract to concrete (returning an empty string) to prevent backwards compatibility issues with existing custom database drivers created by advanced site integrators and third party developers.

avatar infograf768
infograf768 - comment - 15 Jun 2015

I did a sql info here and my version is 5.5.25.
In PHPMyadmin I have the choice for utf8mb4 collations

avatar nikosdion
nikosdion - comment - 15 Jun 2015

@infograf768 Yes, you do (5.5.25 is 22 versions higher than the minimum requirement of MySQL 5.5.3 for this feature), but Joomla! didn't support the utf8mb4 connection collation. This means that even if you manually converted your tables you wouldn't be able to use 4-byte Unicode characters. This is one of the things fixed by this PR, per the (very detailed) description I wrote :)

avatar infograf768
infograf768 - comment - 16 Jun 2015

I meant that your check for sql version is >= 5.5.3
return version_compare($client_version, '5.5.3', '>=');

while it seems the feature is available before (5.5.25).
Evidently J! has to be modified too.

is there something I don't understand?

avatar nikosdion
nikosdion - comment - 16 Jun 2015

No, Jean-Marie. Please consult this page to understand how version numbers work. As you can see 5.5.3 is followed by 5.5.4, 5.5.5, ..., 5.5.9, 5.5.10, 5.5.11, ..., 5.5.19, 5.5.20, 5.5.21, ..., 5.5.24, 5.5.25.

To give you some perspective: if you have a site running Joomla! 2.5.3 are you twenty five releases behind the latest version (2.5.28) or not? Think about it. You already know that, why did you get stuck on this now?

avatar infograf768
infograf768 - comment - 16 Jun 2015

Ooops.... Evidently. Me even older than I thought. Saw a period when none...

avatar nikosdion
nikosdion - comment - 16 Jun 2015

:)

avatar nikosdion nikosdion - change - 16 Jun 2015
Build 3.5-dev staging
avatar nikosdion
nikosdion - comment - 16 Jun 2015

OK, the last notice from phpcs is fixed.

avatar nikosdion
nikosdion - comment - 17 Jun 2015

Now that the Unit Tests are all green could some people please test it on their servers? I've tried it with PHP 5.4, 5.5 and 5.6 and MySQL 5.1 and 5.5 on Mac OS X and Linux. Any other environment will be highly appreciated.

avatar brianteeman
brianteeman - comment - 18 Jun 2015

After following the instructions and using Unicode Supplementary Characters Test Data from http://www.i18nguy.com/unicode/supplementary-test.html#utf8 I can confirm that the issue is resolved.

Tested with php 5.5.23 and 5.3.29 and mysql 5.5.42 on OSX


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7173.

avatar brianteeman brianteeman - test_item - 18 Jun 2015 - Tested successfully
avatar anibalsanchez
anibalsanchez - comment - 20 Jun 2015

Test Ok, now we are talking ... an article with a SMILING FACE on title and alias!!

Tested on php 5.5.9 and MySQL 5.5.43 on Linux Mint 17.1


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7173.

avatar anibalsanchez anibalsanchez - test_item - 20 Jun 2015 - Tested successfully
avatar nikosdion
nikosdion - comment - 20 Jun 2015

@wilsonge Since we have two successful tests can we please set it RTC for the 3.5 branch?

avatar brianteeman brianteeman - change - 20 Jun 2015
Status Pending Ready to Commit
avatar brianteeman
brianteeman - comment - 20 Jun 2015

Setting RTC


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7173.

avatar brianteeman brianteeman - change - 20 Jun 2015
Labels Added: ?
avatar wilsonge wilsonge - change - 20 Jun 2015
Milestone Added:
avatar wilsonge wilsonge - change - 20 Jun 2015
Milestone Added:
avatar wilsonge wilsonge - change - 20 Jun 2015
Milestone Removed:
avatar Kubik-Rubik Kubik-Rubik - close - 9 Jul 2015
avatar Kubik-Rubik Kubik-Rubik - change - 9 Jul 2015
Status Ready to Commit Closed
Closed_Date 0000-00-00 00:00:00 2015-07-09 20:45:50
Closed_By Kubik-Rubik
avatar Kubik-Rubik Kubik-Rubik - close - 9 Jul 2015
avatar mbabker mbabker - change - 9 Jul 2015
Status Closed New
Closed_Date 2015-07-09 20:45:50
Closed_By Kubik-Rubik
avatar mbabker mbabker - reopen - 9 Jul 2015
avatar mbabker mbabker - reopen - 9 Jul 2015
avatar wilsonge wilsonge - reference | 04f09d5 - 10 Jul 15
avatar wilsonge
wilsonge - comment - 10 Jul 2015

Merged with 04f09d5

avatar wilsonge wilsonge - change - 10 Jul 2015
Status New Closed
Closed_Date 0000-00-00 00:00:00 2015-07-10 07:08:15
Closed_By wilsonge
avatar wilsonge wilsonge - close - 10 Jul 2015
avatar wilsonge wilsonge - close - 10 Jul 2015
avatar roland-d roland-d - reference | aa86314 - 6 Aug 15
avatar zero-24 zero-24 - change - 14 Oct 2015
Labels Removed: ?
avatar joomla-cms-bot joomla-cms-bot - change - 14 Oct 2015
Labels Added: ?
avatar zero-24
zero-24 - comment - 4 Nov 2015

This here produces issues expecial the last commit here. Please notice #8267 and #8268

avatar brianteeman
brianteeman - comment - 4 Nov 2015

There are VERY important reasons for this PR do not blindly revert

avatar zero-24
zero-24 - comment - 4 Nov 2015

@brianteeman see: #8268 I'm not reverting I'm fixing but there is still one open point ;)

avatar wilsonge
wilsonge - comment - 26 Nov 2015

@nikosdion The changing of the alias to 191 chars is causing all sort of issues around the place.

After @zero-24 lost his PR due to a mistake rebasing it we merged in manually everything except his alias changes and @roland-d has redone that in #8472 however there have been all sorts of issues raised in that PR. I've been rereading your description of by the 191 character change was made but what I'm not clear on is that does this mean we are going to loose existing characters in alias' in sites etc? If you could reply on #8472 it would be hugely appreciated :)

avatar nikosdion
nikosdion - comment - 26 Nov 2015

Byte count. 255 quad octets are larger than what MySQL allows plain and simple. If you increase that limit stuff breaks. If you don't bc breaks. We can't change how MySQL works. I explained everything in my PR months ago.

Add a Comment

Login with GitHub to post a comment