Joomla! Issue Tracker | Joomla! CMS #5824 - JForm: TRIM filtering does not work properly and is not UTF8-safe

Closed
8 May 2016
Medium
Build: master
# 5824

Hackwar
19 Jan 2015

As described in #5816, the TRIM filter in JForm is not working properly. A string that is surrounded by first normal spaces and then multi-byte-spaces will potentially only have the multi-byte-spaces removed, if at all, and the normal spaces will be left intact. Since trim() is not UTF8-safe, this will also break a bunch of UTF8 characters. Note: JString::trim() does not work with multi-byte-charactermaps either. Have fun with fixing this.

Hackwar - open - 19 Jan 2015

jissues-bot - change - 19 Jan 2015

Labels

Added: ?

bcordis - comment - 19 Jan 2015

Could we us this code? http://pageconfig.com/post/remove-undesired-characters-with-trim_all-php

Hackwar - comment - 19 Jan 2015

zero-24 - change - 19 Jan 2015

Category

⇒

Libraries

infograf768 - comment - 20 Jan 2015

A string that is surrounded by first normal spaces and then multi-byte-spaces will potentially only have the multi-byte-spaces removed

Wrong:
if we have "normal space"+"multibyte space"+string: both spaces will be removed OK as they are in the right order.
It is the opposite which would not work, i.e.
"multibyte space"+"normal space"+string where the normal space would not be removed.

Code is:

case 'TRIM':
                $result = (string) trim($source); // normal space
                $result = trim($result, chr(0xE3) . chr(0x80) . chr(0x80)); // multibyte space
                $result = trim($result, chr(0xC2) . chr(0xA0)); // non-breaking space
                break;

Hackwar - comment - 20 Jan 2015

Which is what I wrote.

infograf768 - comment - 20 Jan 2015

I guess "first" and "then" do not mean the same for you and me... :)

Hackwar - comment - 20 Jan 2015

"surrounded by" normally means, that I first wrap it in one and then the other.

infograf768 - comment - 20 Jan 2015

As you like, my friend. Will not fight for this. Just knowing where is the issue is enough.

Bakual - comment - 20 Jan 2015

What I wonder: Imho the purpose of this method is to remove whitespaces which are accidentally added by the user. It's only some convenient feature.
Or is there a place where we use this where there would something break if there is a whitespace at the beginning or end?
If not, why are we making our life harder than needed? We don't need to take care of any order of multibyte, nonbreaking and regular spaces.
If the purpose is to remove accidentally added spaces, chances are very low that the user added different kinds of those spaces. Most likely he added one or more of the same type. So we can just ignore the mixed case if that makes the code simpler.

infograf768 - comment - 20 Jan 2015

@Bakual
You are right, but nevertheless we should imho make this perfect if we can.
I am working on it right now.

infograf768 - comment - 20 Jan 2015

I think I got it:

            case 'TRIM':
                $result = (string) trim($source);
                $result = trim(trim(trim($result, chr(0xC2) . chr(0xA0)), chr(0xE3) . chr(0x80) . chr(0x80)));
                $result = trim(trim(trim($result, chr(0xE3) . chr(0x80) . chr(0x80)), chr(0xC2) . chr(0xA0)));
                break;

Hackwar - comment - 20 Jan 2015

No, you did not get it. trim() is still not UTF8-safe. It will still remove any order of 0xE3 and/or 0x80 from the beginning and end of the string and no matter how often you wrap a trim() in another trim() will change that.

Regarding different spaces: Copy a text from word into a textbox and you got the situation that you might have different-style spaces.

Hackwar - comment - 20 Jan 2015

trim() simply does NOT understand UTF8. It only sees one-byte-characters, which will rip your 3-byte-UTF8-spaces appart. It does not see 0xE38080, but three separate characters that we want to have removed from beginning and end of a string. That is why we have a JString::trim(). However, even that JString::trim() can only make sure that you do not rip apart UTF8 chars in the target to trim. It does not support UTF8 chars in the character map.

Fedik - comment - 20 Jan 2015

@Bakual as @Hackwar already mentioned , user can copy text from somewhere and there can be more than just white space

infograf768 - comment - 20 Jan 2015

by using the group chr(0x##) . chr(0x##) (and not \0x##\0x##) we do not take off a single 0x## as far as I know. This group is considered as a unique sequence.

infograf768 - comment - 20 Jan 2015

We could use

            case 'TRIM':
                $result = (string) trim($source);
                $result = trim(JString::trim(JString::trim($result, chr(0xC2) . chr(0xA0)), chr(0xE3) . chr(0x80) . chr(0x80)));
                $result = trim(JString::trim(JString::trim($result, chr(0xE3) . chr(0x80) . chr(0x80)), chr(0xC2) . chr(0xA0)));
                break;

Looks like working here.

Fedik - comment - 20 Jan 2015

not sure whether it fit, but in one of my project I used preg_replace( '/[^[:print:]]/', '',$string); (from http://stackoverflow.com/a/8171868/1000711 ) to similar purpose

infograf768 - comment - 20 Jan 2015

At what I read there, that would strip everything except basic printable ascii, which is not the purpose of TRIM here.

Hackwar - comment - 20 Jan 2015

No, those 2 things are creating exactly the same thing. And each chr() creates one byte, the only thing that trim understands. Trim does not understand multibyte (=UTF8) chars.

Von Samsung Mobile gesendet

-------- Ursprüngliche Nachricht --------
Von: infograf768 notifications@github.com
Datum:

An: joomla/joomla-cms joomla-cms@noreply.github.com
Cc: Hannes Papenberg info@joomlager.de
Betreff: Re: [joomla-cms] JForm: TRIM filtering does not work properly and is not UTF8-safe (#5824)

by using the group chr(0x##) . chr(0x##) (and not \0x##\0x##) we do not take off a single 0x## as far as I know. This group is considered as a unique sequence.

—
Reply to this email directly or view it on GitHub.

Hackwar - comment - 20 Jan 2015

Please read my last responses. It does not work.

Von Samsung Mobile gesendet

We could use

        case 'TRIM':
            $result = (string) trim($source);
            $result = trim(JString::trim(JString::trim($result, chr(0xC2) . chr(0xA0)), chr(0xE3) . chr(0x80) . chr(0x80)));
            $result = trim(JString::trim(JString::trim($result, chr(0xE3) . chr(0x80) . chr(0x80)), chr(0xC2) . chr(0xA0)));
            break;

Looks like working here.

—
Reply to this email directly or view it on GitHub.

Bakual - comment - 20 Jan 2015

Regarding different spaces: Copy a text from word into a textbox and you got the situation that you might have different-style spaces.

Yes, that may be. But we don't use this method to trim textareas where you may copy stuff from word. We use it in some places to trim the author, created_by_alias, username, email and such fields. That's why I wondered.
If we can fix it easy, then it's good to do. But if it means introducing very complex code, then I would say we can live with a simpler approach which doesn't take into account mixed things.

rdeutz - comment - 20 Jan 2015

We should start with writing test, then we have description what is the correct behaviour and can code a solution.

Hackwar - comment - 20 Jan 2015

I don't see a solution that works properly. The only thing that you can use is trim() without any parameters and not filter on multi-byte-spaces. I mean, there are other ways, but those are so complex, that it simply is not worth the effort.

infograf768 - comment - 20 Jan 2015

The TRIM was created specially to take care of nonbreaking spaces AND multibytes spaces. It was created, as @Bakual explains above, to take care of a possible error when entering one of these or a normal space (all these spaces being undesirable) before the desired string.
I consider it is worth the effort to improve it and I think you mistake when you refuse to consider the sequence principle in this code.

Hackwar - comment - 20 Jan 2015

Ok, if you want to really make this work, you would have to use something like this:

case 'TRIM':
$cur_val = (string) $source
$whitespace = array(' ', "\t", "\n", "\r", "\0", "\x0B", "\xE3\x80\x80", "\xC2\xA0");
do {
$prev_val = $cur_val;
if (in_array(JString::substr($cur_val, 0, 1), $whitespace))
{
$cur_val = JString::substr($cur_val, 1);
}

if (in_array(JString::substr($cur_val, JString::strlen($cur_val) - 1, 1), $whitespace))
{
$cur_val = JString::substr($cur_val, 0, JString::strlen($cur_val) - 1);
}
} while ($prev_val != $cur_val);
$result = $cur_val;
break;

This cycles through the string as long as there is an unwanted char at the beginning or end, it understands multibyte chars and it is UTF8-safe, from what I can tell. All that without a guarantee for correctness and with the potential that I miscounted the indexes. It is going to be slow as hell as well.

But if you want to remove any multi-byte character from a string, you can forget trim() and also our supposed multi-byte-character-safe solution in JString.

infograf768 - comment - 20 Jan 2015

Your code works fine here (after adding a ; at $cur_val = (string) $source;
I tested by adding multiple types of space before and after the desired string.
I do no think that speed is an issue here.
Can you make a new PR?

@rdeutz
The possibilites are infinite as we have to test multiples as well as order of the spaces.

Here are the spaces:
multibyte space:
「　」

non-breaking:
「」

normal space
「」

infograf768 - comment - 22 Jan 2015

@Hackwar
Will you propose a new PR or shall I do it myself with your code?

Hackwar - comment - 22 Jan 2015

This code is so ugly, that I personally don't want to see it in the Joomla codebase. If you still want to have this, feel free to propose a PR, but I expect this to be backed by proper unittests.

Bakual - comment - 22 Jan 2015

Imho, it should then be in the JString::trim method and not in the JFilterInput class.
JFilterInput should be changed to use the JString for the trimming then like originally suggested here.

Hackwar - comment - 22 Jan 2015

JString::trim() is based on String::trim(), which again is based on a third party library. So you would have to send this change upstream to that developer and expect that developer to agree with you on the implementation. Considering that I couldn't find the original developer of that library, I'd say good luck.

More so, my implementation works for exactly THIS one scenario. It most likely does not work for others. It is also not a drop-in replacement for normal trim, since it does not expect a character map, but an array of (multi-byte) characters.

As I said before, this code is so ugly, that I already regret writing it. I will not be made responsible for any bugs resulting from this. And of course I could think of an implementation that takes a multi-byte character map so that we mimick trim() completely, but that could ultimately run into a DoS attack vulnerability. The code that I've written is slow. With a significant large input, you might be able to let the server crunch on this input for quite some time.

dgt41 - comment - 24 Jan 2015

Could we use https://github.com/danielstjules/Stringy (maybe with some proxies, to be 100% B/C)?
As far as I can tell it handles multibyte UTF8

Hackwar - comment - 24 Jan 2015

No, and not only because his Stringy::trim() is simply a wrapper that castrates the original trim() by not allowing to hand over a charactermap, but also because I don't want us to implement a 1400 lines class for simply having trim(). You can find his trim() around line 1000 and as you can see, it is simply trim and it does even less than the PHP-builtin trim(). He wrote a class that shows how strings should be done in PHP, but I'd rather wait until PHP itself adopts this.

brianteeman - comment - 8 May 2016

Hi you created this issue sometime ago but have not provided any code for people to evaluate. As no one else has shown any interest in providing the code and you have not then I am closing this issue at this time. If code is provided (a pull request) it can always be re-examined.

_{This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/5824.}

brianteeman - change - 8 May 2016

Status	New	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2016-05-08 15:39:51
Closed_By		⇒	brianteeman

brianteeman - close - 8 May 2016

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#5824] - JForm: TRIM filtering does not work properly and is not UTF8-safe

Add a Comment