Joomla! Issue Tracker | Joomla! CMS #5816 - Optimizing TRIM in JFilterInput

Closed
19 Jan 2015
Medium
Build: staging
# 5816
Diff
Hackwar:patch-17

Success

Success The Travis CI build passed Details

User tests: Successful: Unsuccessful:

Hackwar
19 Jan 2015

This PR improves the TRIM filter-method. trim() itself already casts the result to string, so we don't need the casting in the first line. The second and third call can be merged with the default character mask from trim() to one call and we can remove the double 0x80 character in the second character mask, since it will remove all 0x80 from beginning and end, regardless of it being one, two, three or 50 of those characters.

This is hardly testable. It will most likely require a merge by review. I stumbled over this while doing some codereviewing myself.

b8aa3c7 19 Jan 2015

Optimizing TRIM in JFilterInput

Hackwar - open - 19 Jan 2015

jissues-bot - change - 19 Jan 2015

Labels

Added: ?

infograf768 - comment - 19 Jan 2015

It HAS to be tested....
The original code made it clear what was trimmed (normal space (and related usual stuff *), then multibyte space, then non-breaking space. I do not see where is the improvement here.

    " " (ASCII 32 (0x20)), an ordinary space.
    "\t" (ASCII 9 (0x09)), a tab.
    "\n" (ASCII 10 (0x0A)), a new line (line feed).
    "\r" (ASCII 13 (0x0D)), a carriage return.
    "\0" (ASCII 0 (0x00)), the NUL-byte.
    "\x0B" (ASCII 11 (0x0B)), a vertical tab.

And what about other UTF8 characters starting for example by xC2 and not followed by xA0 ?

Hackwar - comment - 19 Jan 2015

trim() applies itself to the string as long as there is something to be stripped from the string. trim() also does not seem to be UTF8 safe. It works on single-byte-characters. That said, the second parameter for trim() is a character mask (= a string) and every character from that string is removed from the beginning or end of a string. So trim(trim($string, "\t"), "\n") is the same as trim($string, "\t\n"). trim() is NOT str_replace() and replaces the content of the second parameter of trim inside the string with nothing, but it splits up the second parameter into single characters. From the documentation of PHP: trim('abc', 'bad') results in the value "c" being returned, because first "a" and then "b" is being stripped.

Hackwar - comment - 19 Jan 2015

Actually, trim(trim($string, "\t"), "\n") is NOT the same as trim($string, "\t\n"). If $string = "\n\t\n" then the first would result in $string = "\t", while the later would result in $string = "". Likewise with the three calls to trim() that we have in the code right now. If the input would be "\xA0 \xA0" the result after sending it through that filter would be " ", while it should be "".

infograf768 - comment - 19 Jan 2015

I know.
This is why deleting xC2 will create an issue if the following chr is NOT xA0, but for example xA3

FYI, chr(0xc2).chr(0xA3) is the Pound glyph, i.e. ₤

bf93181 19 Jan 2015

Fixing encoding issues

Hackwar - comment - 19 Jan 2015

Again, trim() is not UTF8 safe. The current PR is UTF8 aware.

infograf768 - comment - 19 Jan 2015

[19-Jan-2015 10:35:14 UTC] PHP Warning: preg_replace(): No ending delimiter '/' found in /Applications/MAMP/htdocs/stagingcms/libraries/vendor/joomla/string/src/phputf8/trim.php on line 45

infograf768 - comment - 19 Jan 2015

TESTING instructions:
Enter some content in the Created by alias field, Publishing tab when editing an article;
The content should include some spaces:

multibyte space:
「　」

non-breaking:
「 」

normal space
「 」

₤ pound sign

before and after the text

Hackwar - comment - 19 Jan 2015

I'm using the method like it is supposed to be used. If it doesn't work and throw an error, then I would point at our JString implementation or rather our third party UTF8 class. Since the class is not supported anymore anyway, we should maybe look at the whole utf8 string stuff a bit closer...

infograf768 - comment - 19 Jan 2015

This demonstrates again that tests are always necessary...
Therefore, with what we have now, the only issue is when we have multiple types of spaces not in the same order as our code.

Hackwar - comment - 19 Jan 2015

Yes, this demonstrates that tests are necessary. So the original author of this code should have provided tests. That did not happen. The issue is still that trim() is not UTF8-safe and thus will break the pound sign if it is the first or last character in the string. This still means that the original implementation is broken and since the JString::trim() actually does NOT work properly with UTF8 strings, we are screwed. The result of that is that we either drop stripping UTF8 spaces altogether or you need an expert that will write a method that properly removes those multibyte spaces. That however is a much more complex thing and can still not be done by simply writing trim() 3 times around the same string.

Bakual - comment - 19 Jan 2015

So currently it works and this PR breaks it?
@hackwar Do you want to fix your PR or close it as not worth the effort?

Hackwar - comment - 19 Jan 2015

No, it currently does not work. And my PR does not fix it, since our codebase (JString::trim() to be exact) does not work with multibyte chars in the charactermap.

infograf768 - comment - 19 Jan 2015

@Bakual

Therefore, with what we have now, the only issue is when we have multiple types of spaces not in the same order as our code.

I do NOT confirm that the pound sign is deleted with the present code.

Hackwar - close - 19 Jan 2015

Hackwar - comment - 19 Jan 2015

I will close this PR and open a simple issue for this. Let someone else figure this out and fight over this.

Hackwar - close - 19 Jan 2015

Hackwar - change - 19 Jan 2015

Status	Pending	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2015-01-19 11:17:37

Hackwar - head_ref_deleted - 6 Jan 2016

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#5816] - Optimizing TRIM in JFilterInput

Add a Comment