As described in #5816, the TRIM filter in JForm is not working properly. A string that is surrounded by first normal spaces and then multi-byte-spaces will potentially only have the multi-byte-spaces removed, if at all, and the normal spaces will be left intact. Since trim() is not UTF8-safe, this will also break a bunch of UTF8 characters. Note: JString::trim() does not work with multi-byte-charactermaps either. Have fun with fixing this.
Labels |
Added:
?
|
No
Category | ⇒ | Libraries |
A string that is surrounded by first normal spaces and then multi-byte-spaces will potentially only have the multi-byte-spaces removed
Wrong:
if we have "normal space"+"multibyte space"+string: both spaces will be removed OK as they are in the right order.
It is the opposite which would not work, i.e.
"multibyte space"+"normal space"+string where the normal space would not be removed.
Code is:
case 'TRIM':
$result = (string) trim($source); // normal space
$result = trim($result, chr(0xE3) . chr(0x80) . chr(0x80)); // multibyte space
$result = trim($result, chr(0xC2) . chr(0xA0)); // non-breaking space
break;
Which is what I wrote.
I guess "first" and "then" do not mean the same for you and me... :)
"surrounded by" normally means, that I first wrap it in one and then the other.
As you like, my friend. Will not fight for this. Just knowing where is the issue is enough.
What I wonder: Imho the purpose of this method is to remove whitespaces which are accidentally added by the user. It's only some convenient feature.
Or is there a place where we use this where there would something break if there is a whitespace at the beginning or end?
If not, why are we making our life harder than needed? We don't need to take care of any order of multibyte, nonbreaking and regular spaces.
If the purpose is to remove accidentally added spaces, chances are very low that the user added different kinds of those spaces. Most likely he added one or more of the same type. So we can just ignore the mixed case if that makes the code simpler.
I think I got it:
case 'TRIM':
$result = (string) trim($source);
$result = trim(trim(trim($result, chr(0xC2) . chr(0xA0)), chr(0xE3) . chr(0x80) . chr(0x80)));
$result = trim(trim(trim($result, chr(0xE3) . chr(0x80) . chr(0x80)), chr(0xC2) . chr(0xA0)));
break;
No, you did not get it. trim() is still not UTF8-safe. It will still remove any order of 0xE3 and/or 0x80 from the beginning and end of the string and no matter how often you wrap a trim() in another trim() will change that.
Regarding different spaces: Copy a text from word into a textbox and you got the situation that you might have different-style spaces.
trim() simply does NOT understand UTF8. It only sees one-byte-characters, which will rip your 3-byte-UTF8-spaces appart. It does not see 0xE38080, but three separate characters that we want to have removed from beginning and end of a string. That is why we have a JString::trim(). However, even that JString::trim() can only make sure that you do not rip apart UTF8 chars in the target to trim. It does not support UTF8 chars in the character map.
by using the group chr(0x##) . chr(0x##)
(and not \0x##\0x##
) we do not take off a single 0x## as far as I know. This group is considered as a unique sequence.
We could use
case 'TRIM':
$result = (string) trim($source);
$result = trim(JString::trim(JString::trim($result, chr(0xC2) . chr(0xA0)), chr(0xE3) . chr(0x80) . chr(0x80)));
$result = trim(JString::trim(JString::trim($result, chr(0xE3) . chr(0x80) . chr(0x80)), chr(0xC2) . chr(0xA0)));
break;
Looks like working here.
not sure whether it fit, but in one of my project I used preg_replace( '/[^[:print:]]/', '',$string);
(from http://stackoverflow.com/a/8171868/1000711 ) to similar purpose
At what I read there, that would strip everything except basic printable ascii, which is not the purpose of TRIM here.
No, those 2 things are creating exactly the same thing. And each chr() creates one byte, the only thing that trim understands. Trim does not understand multibyte (=UTF8) chars.
Von Samsung Mobile gesendet
-------- Ursprüngliche Nachricht --------
Von: infograf768 notifications@github.com
Datum:
An: joomla/joomla-cms joomla-cms@noreply.github.com
Cc: Hannes Papenberg info@joomlager.de
Betreff: Re: [joomla-cms] JForm: TRIM filtering does not work properly and is not UTF8-safe (#5824)
by using the group chr(0x##) . chr(0x##) (and not \0x##\0x##) we do not take off a single 0x## as far as I know. This group is considered as a unique sequence.
—
Reply to this email directly or view it on GitHub.
Please read my last responses. It does not work.
Von Samsung Mobile gesendet
-------- Ursprüngliche Nachricht --------
Von: infograf768 notifications@github.com
Datum:
An: joomla/joomla-cms joomla-cms@noreply.github.com
Cc: Hannes Papenberg info@joomlager.de
Betreff: Re: [joomla-cms] JForm: TRIM filtering does not work properly and is not UTF8-safe (#5824)
We could use
case 'TRIM':
$result = (string) trim($source);
$result = trim(JString::trim(JString::trim($result, chr(0xC2) . chr(0xA0)), chr(0xE3) . chr(0x80) . chr(0x80)));
$result = trim(JString::trim(JString::trim($result, chr(0xE3) . chr(0x80) . chr(0x80)), chr(0xC2) . chr(0xA0)));
break;
Looks like working here.
—
Reply to this email directly or view it on GitHub.
Regarding different spaces: Copy a text from word into a textbox and you got the situation that you might have different-style spaces.
Yes, that may be. But we don't use this method to trim textareas where you may copy stuff from word. We use it in some places to trim the author, created_by_alias, username, email and such fields. That's why I wondered.
If we can fix it easy, then it's good to do. But if it means introducing very complex code, then I would say we can live with a simpler approach which doesn't take into account mixed things.
We should start with writing test, then we have description what is the correct behaviour and can code a solution.
I don't see a solution that works properly. The only thing that you can use is trim() without any parameters and not filter on multi-byte-spaces. I mean, there are other ways, but those are so complex, that it simply is not worth the effort.
The TRIM was created specially to take care of nonbreaking spaces AND multibytes spaces. It was created, as @Bakual explains above, to take care of a possible error when entering one of these or a normal space (all these spaces being undesirable) before the desired string.
I consider it is worth the effort to improve it and I think you mistake when you refuse to consider the sequence principle in this code.
Ok, if you want to really make this work, you would have to use something like this:
case 'TRIM':
$cur_val = (string) $source
$whitespace = array(' ', "\t", "\n", "\r", "\0", "\x0B", "\xE3\x80\x80", "\xC2\xA0");
do {
$prev_val = $cur_val;
if (in_array(JString::substr($cur_val, 0, 1), $whitespace))
{
$cur_val = JString::substr($cur_val, 1);
}
if (in_array(JString::substr($cur_val, JString::strlen($cur_val) - 1, 1), $whitespace))
{
$cur_val = JString::substr($cur_val, 0, JString::strlen($cur_val) - 1);
}
} while ($prev_val != $cur_val);
$result = $cur_val;
break;
This cycles through the string as long as there is an unwanted char at the beginning or end, it understands multibyte chars and it is UTF8-safe, from what I can tell. All that without a guarantee for correctness and with the potential that I miscounted the indexes. It is going to be slow as hell as well.
But if you want to remove any multi-byte character from a string, you can forget trim() and also our supposed multi-byte-character-safe solution in JString.
Your code works fine here (after adding a ; at $cur_val = (string) $source;
I tested by adding multiple types of space before and after the desired string.
I do no think that speed is an issue here.
Can you make a new PR?
@rdeutz
The possibilites are infinite as we have to test multiples as well as order of the spaces.
Here are the spaces:
multibyte space:
「 」
non-breaking:
「 」
normal space
「 」
This code is so ugly, that I personally don't want to see it in the Joomla codebase. If you still want to have this, feel free to propose a PR, but I expect this to be backed by proper unittests.
Imho, it should then be in the JString::trim method and not in the JFilterInput class.
JFilterInput should be changed to use the JString for the trimming then like originally suggested here.
JString::trim() is based on String::trim(), which again is based on a third party library. So you would have to send this change upstream to that developer and expect that developer to agree with you on the implementation. Considering that I couldn't find the original developer of that library, I'd say good luck.
More so, my implementation works for exactly THIS one scenario. It most likely does not work for others. It is also not a drop-in replacement for normal trim, since it does not expect a character map, but an array of (multi-byte) characters.
As I said before, this code is so ugly, that I already regret writing it. I will not be made responsible for any bugs resulting from this. And of course I could think of an implementation that takes a multi-byte character map so that we mimick trim() completely, but that could ultimately run into a DoS attack vulnerability. The code that I've written is slow. With a significant large input, you might be able to let the server crunch on this input for quite some time.
Could we use https://github.com/danielstjules/Stringy (maybe with some proxies, to be 100% B/C)?
As far as I can tell it handles multibyte UTF8
No, and not only because his Stringy::trim() is simply a wrapper that castrates the original trim() by not allowing to hand over a charactermap, but also because I don't want us to implement a 1400 lines class for simply having trim(). You can find his trim() around line 1000 and as you can see, it is simply trim and it does even less than the PHP-builtin trim(). He wrote a class that shows how strings should be done in PHP, but I'd rather wait until PHP itself adopts this.
Hi you created this issue sometime ago but have not provided any code for people to evaluate. As no one else has shown any interest in providing the code and you have not then I am closing this issue at this time. If code is provided (a pull request) it can always be re-examined.
Status | New | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2016-05-08 15:39:51 |
Closed_By | ⇒ | brianteeman |
Could we us this code? http://pageconfig.com/post/remove-undesired-characters-with-trim_all-php