User tests: Successful: Unsuccessful:
See #12049
This PR takes advantage of the PHP extension intl
when it is enabled to use the transliterator_transliterate()
method, itself using ICU
library.
The php extension is available since php 5.4.0, but may not be enabled on some hosts.
If disabled, former behavior is used which does not accept to upload files with UTF8 names and displays an error. This PR is therefore totally BC.
Using iconv
and IGNORE
let's get rid of some prime-characters that can't be transliterated, like the Cyrillic letter ь
.
Check in System Information => PHP Information that the extension is enabled:
You should get something like this:
(It is more likely to be enabled by default when using php 5.5.x or higher).
If it is not enabled on your local environment, try to modify your PHP.ini.
Change an image file name to utf8.
I used here '完 成'".png
to be sure we also get rid of unwanted characters.
Go to Media Manager =>Upload, select the file and click upload.
In the case above I get a pure ascii name wan-cheng.png
:
Test with other utf8 fine names.
NOTE: J4 may use a similar code.
Status | New | ⇒ | Pending |
Category | ⇒ | Administration com_media |
What will happen if I move a site from a server that does use this php function to a server that does not?
Nothing, as the files already uploaded are OK with their ascii names.
New uploads of files with utf8 names will display the usual error.
Just i still think this new code should go inside makeSafe() !
so that benefit will be everywhere that makeSafe() is used
I was not sure we could do that with unwanted consequences. Therefore why I limited it to com_media.
@joomla/cms-maintainers
What do you think?
I can modify easily.
I was not sure we could do that with unwanted consequences.
What kind of consequences do you expect by moving that code?
Will move the code to makesafe()
Labels |
Added:
?
|
Category | Administration com_media | ⇒ | Libraries |
Title |
|
Title |
|
Tested with different filenames. German, French, Swedish, Czech, Greek, Bulgarian, Chinese.
Personally I would prefer if only one hyphen is used between characters and no hyphen at the end of the name.
e.g. Åländsk höna.jpg (5 blanks) => alandsk-----hona.jpg
Underscore seems to be an allowed character. Is it? And in this case no hyphen was added.
# % & { } $ ! @ _.jpg (blanks between the characters) => _.jpg
Edit: The first character made a header of this line.
As I think the underscore should be transformed to a hyphen and no hyphen at the end of the file name, this one should fail.
I'll make some more tests with disallowed characters but I find this only as details and do not really influence the successful test.
To testers: Check if intl is activated as described above.
On my XAMPP system I had to activate the intl extension in php.ini
Look for the line
;extension=php_intl.dll
Remove the semicolon, save the file and restart the server.
I have tested this item
With intl enabled i get transliteration + existing behaviour
Without intl enabled i get existing behaviour
about:
As I think the underscore should be transformed to a hyphen and no hyphen at the end of the file name, this one should fail.
underscore is a valid character, let's not add such a replacement
I have tested this item
I now agree. Keep character tests as is. What I wrote is too opinioned.
Status | Pending | ⇒ | Ready to Commit |
RTC after two successful tests.
Milestone |
Added: |
i still have doubts about introducing functionality that will not exist for over 10% of the users
Always so positive... even if it was only 10% (where do you take this figure from?) it will help those concerned.
See btw https://w3techs.com/technologies/details/pl-php/5/all = 51% use php >=5.4
php 7 getting in slowly https://w3techs.com/technologies/details/pl-php/7/all
Already shows the extension is present on all these servers. It may or may not be enabled. Impossible to know... I know it is enabled by default on my share host which runs 5.6.30
In a discussion with @mbabker in maintainers chat, he proposed we add in j4 the symphony INTL as a J library, which will make this "new" functionality independent from the php extension itself.
Always so positive... even if it was only 10% (where do you take this figure from?) it will help those concerned.
From the recorded Joomla stats https://developer.joomla.org/about/stats.html
Over the years there have been lots of great proposals that couldn't move forward because of the php version that we support.
We do not read these stats with the same glasses. They just say that 5.3 is used by 13% and all others >=5.4 therefore a possible share of 87%.
The stats do not check for now if a specific function exists. And, anyway, this PR is totally B/C.
Thanks for your concern.
For someone who cares so much about B/C thats a disappointing viewpoint
@brianteeman mate you are wrong here! What @infograf768 is doing here is totally B/C and also is progressive enhancement. I can't see what more can be done in a sensible manner here (given that coding for an unsupported version of PHP is just a waste of time).
The code does not break for PHP 5.3, so it is B/C enough
They just say that 5.3 is used by 13% and all others >=5.4 therefore a possible share of 87%.
possibly you could subtract those sites using mostly latin-based filenames, english, french, etc, because these already have little problems, since they get their filename proper already with just the accented characters removed
And finally users with of non-latin based languages, will be able to both
what is currently happening with these filenames really makes Joomla look bad
We've been very careful to never add functionality that results in different behaviors on different PHP versions. This is crossing that line (hell I'm not even comfortable with my last libsodium PR because it isn't available on PHP 5.3). There really needs to be a majorly strong supporting use case to break this philosophy now, especially as we are slowing down development on the 3.x branch and begin to focus on stability for our current definition of "long term support".
thank you for expressing more eloquently what I tried to say
possibly you could subtract those sites using mostly latin-based filenames, english, french, etc, because these already have little problems, since they get their filename proper already with just the accented characters removed
You can't really subtract Latin filenames, only the ASCII characters (English). For others you often or mostly get unreadable filenames as the result.
Can we please remove the RTC label -
Can we please remove the RTC label -
I don't see why. Release leaders may decide to merge or not, not you.
@mbabker
As it is totally B/C, I see no reason to not include this in the next 3.x version.
It will save time for many people who have the extension present (as it will avoid time loss for those who want to use transliteration instead of unicode slugs to get auto-ascii aliases) and 3.x is going to be available for 2 years more! It does not cost a cent to get it in.
It is not about BC but exactly what the release leader for the next release has stated.
We've been very careful to never add functionality that results in different behaviors on different PHP versions
The behavior is NOT different fundamentaly, it just saves time for people who are lucky enough to have that extension loaded.
As it is totally B/C, I see no reason to not include this in the next 3.x version.
Yes it is B/C. It does not change existing behavior for PHP 5.3 users. That part is not in dispute. What is in dispute though is this introduces a behavior where PHP 5.3 users get logic based on one workflow and PHP 5.4+ users potentially get logic based on another workflow (IIRC the transliterator stuff is always enabled, but iconv is optional). This can result in this function producing different results in different hosting environments, or even pre- and post- upgrade for users moving off of PHP 5.3. That is what makes me uncomfortable accepting this into 3.x. Our API should be consistent regardless of the user being hosted on a PHP 5.3 platform or PHP 7.1, and generally we've done a good job at keeping things consistent.
To be clear, I'm fine with this improvement in general. I'm fine with it landing in 4.0. I'm just not 100% comfortable introducing this into 3.x.
If a = 1 do X
If a = 2 do Y
That is a fundamental different behaviour to me
if a = 1 displays Error
if a = 2 does not display Error
Is indeed different. ROTFLOL
Whatever...
What is about that symfony lib? Don't include this lib the fallbacks?
There is not a polyfill for the transliterator_transliterate
function. There are polyfills available for the iconv
, intl
, and mbstring
PHP extensions.
intl does contains transliterator_transliterate afaik
Even though extensions get polyfilled, not all of their functionality gets replicated in all cases. In Symfony's package, transliterator_transliterate
is not one of the replicated things. And on a quick Google search, I did not see a polyfill for that function.
Therefore, if I understand well, even for 4.0, if the php intl extension is not enabled, we will be exactly in the same situation as for 3.x.
This can result in this function producing different results in different hosting environments
I also think we shouldn't include it in 3.x, it is a difference if we have a different system behaviour on enable/disables php extensions or different php versions. The first one is something you can fix be enabling the extension the second one can't be fixed. Even if this will help some users and improve the system for them, we shouldn't start following the path making the system behave different for different PHP Version on purpose.
Milestone |
Removed: |
Status | Ready to Commit | ⇒ | Needs Review |
Labels |
Added:
?
|
Status | Needs Review | ⇒ | Discussion |
@rdeutz @infograf768 @ggppdk @mbabker and all of you
I suppose it's time to re-evaluate now and set a 4.0 milestone label.
It should really be used in the new media manager.
As it is at the moment, it even got worse as any dashes (-) get removed.
Status | Discussion | ⇒ | Needs Review |
@HLeithner This really should be closed. It's never going to get merged in J3
Similar pr then the other, a rebase to j4 would make sense if it's not fixed already...
Status | Needs Review | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2019-12-29 22:52:16 |
Closed_By | ⇒ | HLeithner |
Are you aware that filenames coming from MS-Windows (Unicode) are translated to utf-8 since PHP version 7.2. i.e. makeFilesafe can be modified now that J4 Beta has the requirement min 7.2.5
It was not possible to change it for the New MediaManager until this decision was made.
I use this behaviour myself since the PHP 7.2 release. I e.g. store filenames with UTF-8 characterset in the database without any noticed problems
I have no clue if the New MediaManager will be backported to J3.10
@schnuti
Not sure what you mean. Afaik one can’t use pure utf8 characters in a path (the image name or anything passing through makeSafe() will be part of a path).
The only way is using percent encodings (same method we use for unicode slugs) which may create problems for file names.
I am discussing now with @HLeithner about including or not the patch above (and the other one specific for aliases) in J4.
I don't know any cases where percent encoding is needed. As far as I can see, the browser or the javascript translates the characters as needed.
Exampel:
pasted in the browsers address field or used browsers Open File:
file:///D:/images/_Üb er_größen-längeå.jpg => shows the image
This copied from the browsers address field, pasted to the windows simple editor:
file:///D:/images/_%C3%9Cb%20er_gr%C3%B6%C3%9Fen-l%C3%A4nge%C3%A5.jpg
javascript is internally using %-encoding if needed as I've seen when debugging.
The same is valid for using alias with those characters. No router problems as far as I can see.
I have no clue if the New MediaManager will be backported to J3.10
This part is easy, no it will not be backported, 3.10 is a compatibility release for J4 with no new planed features.
This patch is for saving files and to have a filename that does a bit of a meaning for non latin languages. I'm not sure if restricting filenames to latin characters only is necessary but I also see no reason to remove this limitation atm.
As poly-fill to intl extension we could use https://github.com/LukeMadhanga/transliterator for example. But only for J4+
What's @wilsonge opinion?
As poly-fill to intl extension we could use https://github.com/LukeMadhanga/transliterator for example. But only for J4+
Looks like it is rather limited.
what can it not do that the intl extension can do?
Look by yourself which character sets it deals with in its description...
what can it not do that the intl extension can do?
I would say it misses any asia language.
and arabic, and hebrew, persian, etc.
it is a simple transliterator, just a bit more complex than the default core one.
the php intl extension uses ICU as clearly explained in this PR description.
Sorry for the delay replying here - There's no intention to backport new media manager into 3.10 - if nothing else browser requirements will kill us. I'm more than happy to see this go into J4 media manager though - will leave 3.x decision to @HLeithner
We have it in j4 now, I don't want to change this in J3. thx
@HLeithner
Nope, we do not have it in J4.
It was only done for the url alias.
I have prepared a patch for J4 and asked @wilsonge if Ok to propose.
What will happen if I move a site from a server that does use this php function to a server that does not?