Joomla! Issue Tracker | Joomla! CMS #24498 - Chinese strings are not correctly truncated if $noSplit is set to true.

No Code Attached Yet J4 Issue bug

Confirmed
Medium
Build: staging
# 24498

newjie
5 Apr 2019

Steps to reproduce the issue

   Use JHtmlString::truncate to truncate Chinese string, with the parameter $noSplit being true

Expected result

   The string should be truncated correctly in accordance to the $length parameter

Actual result

   The function returns "..."

System information (as much as possible)

   Joomla 3.9.4

Additional comments

I use truncate function to get the first 140 characters of articles. If $noSplit is set to false, then there will be no problem. But in some cases, $noSplit has to be true to make sure English texts are correctly truncated as well.
in `\libraries\cms\html\string.php" line 74 to 83, the comment itself states the problem:

// Find the position of the last space within the allowed length.
// If there are no spaces and the string is longer than the maximum
// we need to just use the ellipsis. In that case we are done.

Because there is generally no space in Chinese texts, line 75

$offset = StringHelper::strrpos($tmp, ' ');

will return false, then the truncate function returns "..." at line 83.

newjie - open - 5 Apr 2019

joomla-cms-bot - change - 5 Apr 2019

Labels

Added: ?

joomla-cms-bot - labeled - 5 Apr 2019

newjie - change - 5 Apr 2019

The description was changed

newjie - edited - 5 Apr 2019

newjie - change - 5 Apr 2019

The description was changed

newjie - edited - 5 Apr 2019

newjie - change - 5 Apr 2019

The description was changed

newjie - edited - 5 Apr 2019

franz-wohlkoenig - change - 5 Apr 2019

Category

⇒

Code style Language & Strings

infograf768 - comment - 5 Apr 2019

Chinese may also use the Multibyte space 「　」
We could add in code this possibility but it would not solve the issue when no space at all.

Only way would be to detect the range of chinese multibyte characters in general in the string concerned. But this is very complex as there are many blocks possible.
Range should be something like
[0x3007,0x3007],[0x3400,0x4DBF],[0x4E00,0x9FEF],[0x20000,0x2EBFF]

( https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode )

franz-wohlkoenig - change - 5 Apr 2019

Status

New

⇒

Discussion

franz-wohlkoenig - change - 5 Apr 2019

Labels

Added: ?

franz-wohlkoenig - labeled - 5 Apr 2019

infograf768 - change - 5 Apr 2019

Labels

Added: J3 Issue
Removed: ?

infograf768 - unlabeled - 5 Apr 2019

infograf768 - labeled - 5 Apr 2019

newjie - comment - 5 Apr 2019

Multibyte space is generally not used in Chinese text, not even occasionally. Even if it is used, it is not used to separate words.
How about this change:

				if($offset) $tmp = StringHelper::substr($tmp, 0, $offset + 1);
				
				// If there are no spaces and the string is longer than the maximum
				// we need to just use the ellipsis. In that case we are done.
				// if ($offset === false && strlen($text) > $length)
				// {
				// 	return '...';
				// }

Imho in this way we don't have to deal with Chinese character range, but only to bear with getting half of the first word when people want to truncate a length even shorter than the first word, which I think is very rare.

franz-wohlkoenig - change - 7 Apr 2019

Labels

Added: ?

franz-wohlkoenig - labeled - 7 Apr 2019

franz-wohlkoenig - change - 9 Apr 2019

Category

Language & Strings

⇒

franz-wohlkoenig - change - 11 Apr 2019

Labels

Added: ?

franz-wohlkoenig - labeled - 11 Apr 2019

franz-wohlkoenig - change - 11 Apr 2019

Labels

Removed: ?

franz-wohlkoenig - unlabeled - 11 Apr 2019

obuisard - comment - 11 Mar 2020

I do think the problem comes from the strlen php function. When using languages like Chinese, Japanese, ... the function mb_strlen should be used instead. Problem arrise because strlen calculates the byte number, not the number of characters

infograf768 - comment - 12 Mar 2020

I guess we should first check if the functionmb_strlen is available, but isn't it already provided in J by the utf8 library and

function utf8_strlen($str){
    return mb_strlen($str);
}

Can you test changing in public static function truncate all occurences of strlen to utf8_strlen ?

obuisard - comment - 12 Mar 2020

I would actually do mb_strlen($str, 'utf8').
I am going to try and test this. Not that easy lol.
This should be applied to truncate and to truncateComplex.

Actually, looking closely at the code, truncate correctly uses StringHelper::strlen() except when testing $noSplit, where a test uses strlen instead of StringHelper::strlen(). That may be the problem for truncate.

truncateComplex never uses StringHelper::strlen().

It seems like truncate and truncateComplex should be code reviewed because I can see other places where those functions use the regular PHP functions (like strlen, strpos...) rather than the ones from StringHelper (which guaranties it is utf-8 aware).

obuisard - comment - 12 Mar 2020

After doing some testing, even though utf-8 aware functions are missing and should be added where missing, this does not fix the problem with spaces mentioned by @newjie. I don't have a solution here :-(