Use JHtmlString::truncate to truncate Chinese string, with the parameter $noSplit being true
The string should be truncated correctly in accordance to the $length parameter
The function returns "..."
Joomla 3.9.4
I use truncate function to get the first 140 characters of articles. If $noSplit is set to false, then there will be no problem. But in some cases, $noSplit has to be true to make sure English texts are correctly truncated as well.
in `\libraries\cms\html\string.php" line 74 to 83, the comment itself states the problem:
// Find the position of the last space within the allowed length.
// If there are no spaces and the string is longer than the maximum
// we need to just use the ellipsis. In that case we are done.
Because there is generally no space in Chinese texts, line 75
$offset = StringHelper::strrpos($tmp, ' ');
will return false, then the truncate function returns "..." at line 83.
Labels |
Added:
?
|
Category | ⇒ | Code style Language & Strings |
Status | New | ⇒ | Discussion |
Labels |
Added:
?
|
Labels |
Added:
J3 Issue
Removed: ? |
Multibyte space is generally not used in Chinese text, not even occasionally. Even if it is used, it is not used to separate words.
How about this change:
if($offset) $tmp = StringHelper::substr($tmp, 0, $offset + 1);
// If there are no spaces and the string is longer than the maximum
// we need to just use the ellipsis. In that case we are done.
// if ($offset === false && strlen($text) > $length)
// {
// return '...';
// }
Imho in this way we don't have to deal with Chinese character range, but only to bear with getting half of the first word when people want to truncate a length even shorter than the first word, which I think is very rare.
Labels |
Added:
?
|
Category | Code style Language & Strings | ⇒ | Language & Strings |
Category | Language & Strings | ⇒ |
Labels |
Added:
?
|
Labels |
Removed:
?
|
I do think the problem comes from the strlen php function. When using languages like Chinese, Japanese, ... the function mb_strlen should be used instead. Problem arrise because strlen calculates the byte number, not the number of characters
I guess we should first check if the functionmb_strlen
is available, but isn't it already provided in J by the utf8 library and
function utf8_strlen($str){
return mb_strlen($str);
}
Can you test changing in public static function truncate
all occurences of strlen
to utf8_strlen
?
I would actually do mb_strlen($str, 'utf8')
.
I am going to try and test this. Not that easy lol.
This should be applied to truncate
and to truncateComplex
.
Actually, looking closely at the code, truncate
correctly uses StringHelper::strlen()
except when testing $noSplit
, where a test uses strlen
instead of StringHelper::strlen()
. That may be the problem for truncate
.
truncateComplex
never uses StringHelper::strlen()
.
It seems like truncate
and truncateComplex
should be code reviewed because I can see other places where those functions use the regular PHP functions (like strlen, strpos...) rather than the ones from StringHelper
(which guaranties it is utf-8 aware).
Status | Discussion | ⇒ | Confirmed |
Labels |
Added:
No Code Attached Yet
Removed: ? ? |
I am going to assume that there has been no change and that this is still an issue in Joomla 4
Please update the label to J4 Issue
Labels |
Added:
J4 Issue
|
Labels |
Added:
bug
|
Labels |
Removed:
J3 Issue
|
Chinese may also use the Multibyte space
「 」
We could add in code this possibility but it would not solve the issue when no space at all.
Only way would be to detect the range of chinese multibyte characters in general in the string concerned. But this is very complex as there are many blocks possible.
Range should be something like
[0x3007,0x3007],[0x3400,0x4DBF],[0x4E00,0x9FEF],[0x20000,0x2EBFF]
( https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode )
See also:
https://zh.wikipedia.org/wiki/%E4%B8%AD%E6%97%A5%E9%9F%93%E7%9B%B8%E5%AE%B9%E8%A1%A8%E6%84%8F%E6%96%87%E5%AD%97