User tests: Successful: Unsuccessful:
This PR improves the TRIM filter-method. trim() itself already casts the result to string, so we don't need the casting in the first line. The second and third call can be merged with the default character mask from trim() to one call and we can remove the double 0x80 character in the second character mask, since it will remove all 0x80 from beginning and end, regardless of it being one, two, three or 50 of those characters.
This is hardly testable. It will most likely require a merge by review. I stumbled over this while doing some codereviewing myself.
Labels |
Added:
?
|
trim() applies itself to the string as long as there is something to be stripped from the string. trim() also does not seem to be UTF8 safe. It works on single-byte-characters. That said, the second parameter for trim() is a character mask (= a string) and every character from that string is removed from the beginning or end of a string. So trim(trim($string, "\t"), "\n")
is the same as trim($string, "\t\n")
. trim() is NOT str_replace() and replaces the content of the second parameter of trim inside the string with nothing, but it splits up the second parameter into single characters. From the documentation of PHP: trim('abc', 'bad')
results in the value "c" being returned, because first "a" and then "b" is being stripped.
Actually, trim(trim($string, "\t"), "\n")
is NOT the same as trim($string, "\t\n")
. If $string = "\n\t\n" then the first would result in $string = "\t", while the later would result in $string = "". Likewise with the three calls to trim() that we have in the code right now. If the input would be "\xA0 \xA0" the result after sending it through that filter would be " ", while it should be "".
I know.
This is why deleting xC2 will create an issue if the following chr is NOT xA0, but for example xA3
FYI, chr(0xc2).chr(0xA3) is the Pound glyph, i.e. ₤
Again, trim() is not UTF8 safe. The current PR is UTF8 aware.
[19-Jan-2015 10:35:14 UTC] PHP Warning: preg_replace(): No ending delimiter '/' found in /Applications/MAMP/htdocs/stagingcms/libraries/vendor/joomla/string/src/phputf8/trim.php on line 45
TESTING instructions:
Enter some content in the Created by alias field, Publishing tab when editing an article;
The content should include some spaces:
multibyte space:
「 」
non-breaking:
「 」
normal space
「 」
₤ pound sign
before and after the text
I'm using the method like it is supposed to be used. If it doesn't work and throw an error, then I would point at our JString implementation or rather our third party UTF8 class. Since the class is not supported anymore anyway, we should maybe look at the whole utf8 string stuff a bit closer...
This demonstrates again that tests are always necessary...
Therefore, with what we have now, the only issue is when we have multiple types of spaces not in the same order as our code.
Yes, this demonstrates that tests are necessary. So the original author of this code should have provided tests. That did not happen. The issue is still that trim() is not UTF8-safe and thus will break the pound sign if it is the first or last character in the string. This still means that the original implementation is broken and since the JString::trim() actually does NOT work properly with UTF8 strings, we are screwed. The result of that is that we either drop stripping UTF8 spaces altogether or you need an expert that will write a method that properly removes those multibyte spaces. That however is a much more complex thing and can still not be done by simply writing trim() 3 times around the same string.
No, it currently does not work. And my PR does not fix it, since our codebase (JString::trim() to be exact) does not work with multibyte chars in the charactermap.
I will close this PR and open a simple issue for this. Let someone else figure this out and fight over this.
Status | Pending | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2015-01-19 11:17:37 |
It HAS to be tested....
The original code made it clear what was trimmed (normal space (and related usual stuff *), then multibyte space, then non-breaking space. I do not see where is the improvement here.
And what about other UTF8 characters starting for example by xC2 and not followed by xA0 ?