No Code Attached Yet J4 Issue bug
avatar newjie
newjie
5 Apr 2019

Steps to reproduce the issue

   Use JHtmlString::truncate to truncate Chinese string, with the parameter $noSplit being true

Expected result

   The string should be truncated correctly in accordance to the $length parameter

Actual result

   The function returns "..."

System information (as much as possible)

   Joomla 3.9.4

Additional comments

I use truncate function to get the first 140 characters of articles. If $noSplit is set to false, then there will be no problem. But in some cases, $noSplit has to be true to make sure English texts are correctly truncated as well.
in `\libraries\cms\html\string.php" line 74 to 83, the comment itself states the problem:

// Find the position of the last space within the allowed length.
// If there are no spaces and the string is longer than the maximum
// we need to just use the ellipsis. In that case we are done.

Because there is generally no space in Chinese texts, line 75

$offset = StringHelper::strrpos($tmp, ' ');

will return false, then the truncate function returns "..." at line 83.

avatar newjie newjie - open - 5 Apr 2019
avatar joomla-cms-bot joomla-cms-bot - change - 5 Apr 2019
Labels Added: ?
avatar joomla-cms-bot joomla-cms-bot - labeled - 5 Apr 2019
avatar newjie newjie - change - 5 Apr 2019
The description was changed
avatar newjie newjie - edited - 5 Apr 2019
avatar newjie newjie - change - 5 Apr 2019
The description was changed
avatar newjie newjie - edited - 5 Apr 2019
avatar newjie newjie - change - 5 Apr 2019
The description was changed
avatar newjie newjie - edited - 5 Apr 2019
avatar franz-wohlkoenig franz-wohlkoenig - change - 5 Apr 2019
Category Code style Language & Strings
avatar infograf768
infograf768 - comment - 5 Apr 2019

Chinese may also use the Multibyte space 「 」
We could add in code this possibility but it would not solve the issue when no space at all.

Only way would be to detect the range of chinese multibyte characters in general in the string concerned. But this is very complex as there are many blocks possible.
Range should be something like
[0x3007,0x3007],[0x3400,0x4DBF],[0x4E00,0x9FEF],[0x20000,0x2EBFF]

( https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode )

See also:
https://zh.wikipedia.org/wiki/%E4%B8%AD%E6%97%A5%E9%9F%93%E7%9B%B8%E5%AE%B9%E8%A1%A8%E6%84%8F%E6%96%87%E5%AD%97

avatar franz-wohlkoenig franz-wohlkoenig - change - 5 Apr 2019
Status New Discussion
avatar franz-wohlkoenig franz-wohlkoenig - change - 5 Apr 2019
Labels Added: ?
avatar franz-wohlkoenig franz-wohlkoenig - labeled - 5 Apr 2019
avatar infograf768 infograf768 - change - 5 Apr 2019
Labels Added: J3 Issue
Removed: ?
avatar infograf768 infograf768 - unlabeled - 5 Apr 2019
avatar infograf768 infograf768 - labeled - 5 Apr 2019
avatar newjie
newjie - comment - 5 Apr 2019

Multibyte space is generally not used in Chinese text, not even occasionally. Even if it is used, it is not used to separate words.
How about this change:

				if($offset) $tmp = StringHelper::substr($tmp, 0, $offset + 1);
				
				// If there are no spaces and the string is longer than the maximum
				// we need to just use the ellipsis. In that case we are done.
				// if ($offset === false && strlen($text) > $length)
				// {
				// 	return '...';
				// }

Imho in this way we don't have to deal with Chinese character range, but only to bear with getting half of the first word when people want to truncate a length even shorter than the first word, which I think is very rare.

avatar franz-wohlkoenig franz-wohlkoenig - change - 7 Apr 2019
Labels Added: ?
avatar franz-wohlkoenig franz-wohlkoenig - labeled - 7 Apr 2019
avatar franz-wohlkoenig franz-wohlkoenig - change - 9 Apr 2019
Category Code style Language & Strings Language & Strings
avatar franz-wohlkoenig franz-wohlkoenig - change - 11 Apr 2019
Category Language & Strings
avatar franz-wohlkoenig franz-wohlkoenig - change - 11 Apr 2019
Labels Added: ?
avatar franz-wohlkoenig franz-wohlkoenig - labeled - 11 Apr 2019
avatar franz-wohlkoenig franz-wohlkoenig - change - 11 Apr 2019
Labels Removed: ?
avatar franz-wohlkoenig franz-wohlkoenig - unlabeled - 11 Apr 2019
avatar obuisard
obuisard - comment - 11 Mar 2020

I do think the problem comes from the strlen php function. When using languages like Chinese, Japanese, ... the function mb_strlen should be used instead. Problem arrise because strlen calculates the byte number, not the number of characters

avatar infograf768
infograf768 - comment - 12 Mar 2020

I guess we should first check if the functionmb_strlen is available, but isn't it already provided in J by the utf8 library and

function utf8_strlen($str){
    return mb_strlen($str);
}

Can you test changing in public static function truncate all occurences of strlen to utf8_strlen ?

avatar obuisard
obuisard - comment - 12 Mar 2020

I would actually do mb_strlen($str, 'utf8').
I am going to try and test this. Not that easy lol.
This should be applied to truncate and to truncateComplex.

Actually, looking closely at the code, truncate correctly uses StringHelper::strlen() except when testing $noSplit, where a test uses strlen instead of StringHelper::strlen(). That may be the problem for truncate.

truncateComplex never uses StringHelper::strlen().

It seems like truncate and truncateComplex should be code reviewed because I can see other places where those functions use the regular PHP functions (like strlen, strpos...) rather than the ones from StringHelper (which guaranties it is utf-8 aware).

avatar obuisard
obuisard - comment - 12 Mar 2020

After doing some testing, even though utf-8 aware functions are missing and should be added where missing, this does not fix the problem with spaces mentioned by @newjie. I don't have a solution here :-(

avatar jwaisner jwaisner - change - 17 Apr 2020
Status Discussion Confirmed
avatar Quy Quy - change - 16 Feb 2022
Labels Added: No Code Attached Yet
Removed: ? ?
avatar Quy Quy - unlabeled - 16 Feb 2022
avatar brianteeman
brianteeman - comment - 27 Aug 2022

I am going to assume that there has been no change and that this is still an issue in Joomla 4

Please update the label to J4 Issue

avatar obuisard obuisard - change - 27 Aug 2022
Labels Added: J4 Issue
avatar obuisard obuisard - labeled - 27 Aug 2022
avatar Hackwar Hackwar - change - 19 Feb 2023
Labels Added: bug
avatar Hackwar Hackwar - labeled - 19 Feb 2023
avatar rdeutz rdeutz - change - 29 Apr 2024
Labels Removed: J3 Issue
avatar rdeutz rdeutz - unlabeled - 29 Apr 2024

Add a Comment

Login with GitHub to post a comment