? Failure

User tests: Successful: Unsuccessful:

avatar ggppdk
ggppdk
16 Sep 2016

Pull Request for Issue #7841

Summary of Changes

  1. Multiple transliterations are tried
  2. [RFC] In order to avoid ugly looking empty filenames, current time is used if alls transliteration fail to do a complete job
  3. Added new method makeUnique($base_Dir, $file) for finding a unique file name inside a given folder

Testing Instructions

Upload files that are using characters in SITE or ADMIN default language of your website

  • Filenames should be transliterated

Please also do a code review if you can

Documentation Changes Required

A new optional parameter $language = null is added to the method ... for doing a specific language transliteration

avatar ggppdk ggppdk - open - 16 Sep 2016
avatar ggppdk ggppdk - change - 16 Sep 2016
Status New Pending
avatar joomla-cms-bot joomla-cms-bot - change - 16 Sep 2016
Labels Added: ?
avatar joomla-cms-bot joomla-cms-bot - change - 16 Sep 2016
Category Libraries
avatar ggppdk ggppdk - change - 16 Sep 2016
The description was changed
avatar ggppdk ggppdk - edited - 16 Sep 2016
avatar ggppdk
ggppdk - comment - 16 Sep 2016

@ralain

The time stamp is going to cause issues when uploading multiple files in the same request, or even by multiple users to the same location at the exact same time. Chances are each file after the first will overwrite the previous.

Yes i know, and in my own custom code i have extra code to make sure that a unique name is created by examining the path too, but we should not do it inside makeSafe,

  • it is "makeSafe" ... not "makeUnique", it had no such functionality before this PR,
  • and it does not have parameter to take folder path to make such a check

best option is to add seperate method for it,

  • we can make it more very rare by adding microsends (passing to date the seconds and then appending to the final string the microseconds) , still for the cases that transliteration succeeds then don't we need to make a check too ?, that is why i say it must be in a seperate method of the CLASS, or at custom code of the component

@wilsonge

$lang_params = JComponentHelper::getParams('com_languages');
I'm nervous about introducing component deps into libraries (this is an ongoing debate). Is this really giving us anything? If the article itself has no language then should be really set a language at all even if there is a default site one...

about dependency i agree, any better ideas ? it would be nice to avoid it,
about no language, well it makes sense to try frontend language default

[EDIT]
@wilsonge
we could remove the dependency to the configuration of the language component and make it the responsibilty of the caller, ... to decide and pass an array of languages to try for transliteration
should i do this change ?? and remove the dependency ?

avatar ggppdk ggppdk - change - 16 Sep 2016
The description was changed
avatar ggppdk ggppdk - edited - 16 Sep 2016
avatar ggppdk
ggppdk - comment - 16 Sep 2016

i will remove dependency on language component configuration, and fix code styling, please do not test yet

avatar franz-wohlkoenig
franz-wohlkoenig - comment - 11 Jan 2017

@ggppdk still not testing?

avatar webmiep webmiep - test_item - 31 Mar 2017 - Tested unsuccessfully
avatar webmiep
webmiep - comment - 31 Mar 2017

I have tested this item ? unsuccessfully on 30abfa6

By appliing the PR the site give this error
Parse error: syntax error, unexpected 'if' (T_IF) in /home/xxx/public_html/nl/xxxxxx/libraries/joomla/filesystem/file.php on line 126

Sorry I was testing on a joomla 3.7rc1 site


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12049.

avatar webmiep
webmiep - comment - 31 Mar 2017

By appliing the PR the site give this error
Parse error: syntax error, unexpected 'if' (T_IF) in /home/xxx/public_html/nl/xxxxxx/libraries/joomla/filesystem/file.php on line 126

Sorry I was testing on a joomla 3.7rc1 site


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12049.

avatar ggppdk
ggppdk - comment - 1 Apr 2017

hi, having more family members now, takes up a good portion of my time, now i can only work on to updating my extensions for J3.7 and on my other obligations, i can not update / look into this PR any more

you can close or rewrite this PR, it should work with only a few changes

avatar franz-wohlkoenig franz-wohlkoenig - change - 1 Apr 2017
The description was changed
Status Pending Needs Review
avatar joomla-cms-bot joomla-cms-bot - edited - 1 Apr 2017
avatar ggppdk
ggppdk - comment - 26 May 2017

@mbabker
i now have time to update / support this PR,
you can remove the label "needs new owner"
thanks

avatar brianteeman
brianteeman - comment - 26 May 2017

Label removed

avatar ggppdk
ggppdk - comment - 9 Jun 2017

@wilsonge and anyone else

I'm nervous about introducing component deps into libraries (this is an ongoing debate).

so , any idea how i can replace usage of :

$lang_params = JComponentHelper::getParams('com_languages');

since as said above it should be avoided

avatar schnuti
schnuti - comment - 20 Jun 2017

@ggppdk
Can't you send an array with the languages you now look up in the J library file?
e.g. a "helper" method in the CMS library could do the job and send the filename + languages to the J library.

I have to ask. Are those languages needed to "change" the translation-"table". I once did that "manually" in Joomla 1.5 for a cyrillic language..

avatar ggppdk
ggppdk - comment - 21 Jun 2017

@wilsonge

you suggested the removal of dependency JComponentHelper so that JComponentHelper::getParams('com_languages') is avoided

Can you can suggest an alternative for it ? can anyone else suggest an alternative ?

avatar schnuti
schnuti - comment - 21 Jun 2017

@ggppdk
Didn't you read my suggestion and question? I saw you had the same idea somewhere above. Let the caller send the languages.

I found confusing use of makeSafe.
Correct in old com_media file controller

$file['name']     = JFile::makeSafe($file['name']);

But what happens here? In cms/helper/media method canUpload($file...

if (str_replace(' ', '', $file['name']) != $file['name'] || $file['name'] !== JFile::makeSafe($file['name']))

This tells me (the user) to make the filename safe with Joomla logic before uploading. (Currently in new mediamanager 403 forbidden.) i.e. even if the makeSafe code is modified this will kill the upload.

Related joomla-projects/media-manager-improvement#260

avatar ggppdk
ggppdk - comment - 21 Jun 2017

@schnuti

Didn't you read my suggestion and question? I saw you had the same idea somewhere above. Let the caller send the languages.

This PR already includes this new parameter for makeSafe

but

  • it will involve code replication at all calling places
  • we will need to search and change all of them
  • also 3rd party extension will need to do same change when they call makeSafe to get the benefit

so i was hoping to avoid it

avatar schnuti
schnuti - comment - 21 Jun 2017

That was the second step i had.

e.g. a "helper" method in the CMS library could do the job and send the filename + languages to the J library.

Now is the question if this is possible in a BC way. I'm not fit enough with namespaces to know if a call to makeSafe can execute a CMS method and this method can call a makeSafe Joomla method. I hope so!

avatar schnuti
schnuti - comment - 21 Jun 2017

I*ve implemented the code in J 4.0 and tested to upload in the new Media Manager.
I moved the language "loading" out of the library file to the controller taking care of the AJAX call.

// Language transliteration should include given language, and also site + admin defaults (most useful is site default)
$lang_params = ComponentHelper::getParams('com_languages');
$lang_site_default  = $lang_params->get('site', '*');
$lang_admin_default = $lang_params->get('admin', '*');

//$languages[$language]   = $language && $language != '*';
$languages[$lang_site_default]  = $lang_site_default != '*';
$languages[$lang_admin_default] = $lang_admin_default != '*';

// Call to transliteration method
$name = \JFile::makeSafe($name, $languages); 

There are some issues not related directly to this as I mentioned above.

To my surprise a method is called that modifies the file name with Punycode! Why?
After removing those obstacles I tested with german and swedish characters with result as expected. Without installing the languages!
I had problems with french characters so I installed the language pack and set it to default language. I still have a problem here so I get Datetime as filename. I found that normal french characters are translated. It's the character in d'Artagnan that is NOT removed by the transliteration method. The character gets removed by the regex but then the check translated? fails.

I got some problems with the regex not working. As I do not understand what it is supposed to do I used my smaller version that so far works.
$regex = '/(\s|[^A-Za-z0-9-.])+/';
How to fix that d'Artagnan gets dArtaggnan or d-Artagnan and not datetime?

I tested the language specific characters + blanks. I had to make some changes to handle the blanks - to replace them with hyphens.

It has to be tested in J3 to find out if there are three different handlings of the filename in ONE TRANSACTION as in J 4.0. Then it should be cleaned up.

avatar schnuti
schnuti - comment - 22 Jun 2017

I've made some more testing (Joomla 4.0 MM upload file)
Western and to my surprise Eastern European Latin characters are translated by default using the en-GB language . Probably without a language parameter.
e.g.
Über größen länge .jpg => ueber-groessen-laenge.jpg (German)
d'Édimbourg a été admis à l'hô .jpg => d-edimbourg-a-ete-admis-a-l-ho.jpg (French)
Åländsk höna.jpg => alaendsk-hoena.jpg (Swedish)
výročí na příští.jpg => vyroci-na-pristi.jpg (Czech)
But sorry they are all using Latin Characters
Πωλούν ακίνητα .jpg => 2017-06-22_23-00-42_000000.jpg (Greek)

I'm not able to test with Greek installed as default language.

I'll try to exclude (replace with hyphens) some more characters with regex. I'll follow the recommendations from Michigan Technological University. Link in my "Normalize filenames" issue in the "New Media Manager" project.

You can see the edited method code in my gist -
file-makeSafe.php

I'll edit the regex when I've tested the changes I have in mind.

Shouldn't the length be checked and limited? 190 characters or less? I believe 190 is the limit if you later want to index the filenames in a mySql table. MTU's 30 is a little too minimized;)

avatar schnuti
schnuti - comment - 22 Jun 2017

We can close the idea of using the filenames from the OS (Unicode/UTF or whatever).
In version 1 of my Gist I had methods to save the filenames as is. They were ok in the PHP code but Joomla's file write method wrote the files to disk with unreadable characters if not in the range a-zA-Z.

If there are some Joomla hackers that want to invest time in such an issue he/she would be very welcome.

avatar ggppdk
ggppdk - comment - 22 Jun 2017

In version 1 of my Gist I had methods to save the filenames as is. They were ok in the PHP code but Joomla's file write method wrote the files to disk with unreadable characters if not in the range a-zA-Z.

That must have been in windows ? right ?
See my comment here
#16595 (comment)

About this PR, i do not know how to proceed,
can it be accepted to use
$lang_params = JComponentHelper::getParams('com_languages');

inside makeSafe() or is it strictly a "Not do it" ?

avatar schnuti
schnuti - comment - 23 Jun 2017

Yes local Wiindows10, XAMPP, PHP7
If it doesn't work in Windows it's a no go for me You can't allow OS dependent functionality in a CMS for known reasons.

I moved the language look up out of the method for the test. My hope is that the call to makeSafe can be directed to a CMS method and then via the use of namespace be redirected to the Joomla library. I see no problem to move the lines back if this is not possible.

You should though have a look at the gist code. I've made some editing to solve the issues I found. e.g. French texts did not work. Not allowed characters have to be removed before the transliteration loop, where those characters are checked and gives a *false" result. And this result is false!

avatar schnuti
schnuti - comment - 23 Jun 2017

I had a short look at proposed LINUX solution. It is NOT a solution. The changes are only in the library file and folder classes. i.e. Any checks before calling those methods in the library e.g. in makeSafe also have to react on the OS version! A complete mess!

avatar schnuti
schnuti - comment - 23 Jun 2017

I've updated my gist
Please check the regex. As I wrote - the filename transferred might not come from an OS. With AJAX and base64 encoded data anything is possible!
I added a length restriction. Has nothing to do with the file size restriction.

I think I have everything in my mind covered by now.

EDIT: And tested!

avatar schnuti
schnuti - comment - 23 Jun 2017

Maybe I should open a new issue to ask for some kind of input. As a user not related to the English world the supported filename conversion is a disaster! Since Joomla 1.5! As I understand you have your workaround and I have another one.

avatar ggppdk
ggppdk - comment - 23 Jun 2017

You should though have a look at the gist code. I've made some editing to solve the issues I found. e.g. French texts did not work. Not allowed characters have to be removed before the transliteration loop, where those characters are checked and gives a *false" result. And this result is false!

@schnuti

Well i have made it return false (meaning result of transliteration is not acceptable)
on purpose and forced the date('Y-m-d_H-i-s_u') as safe filename

Reason is this:

you suggest that:
$file = '\ / : * ? <> myfile| " # % & { } $ ! @ _.jpg';
should give the safe filename: myfile.jpg

but with your suggest if e.g. you have
$file = '\ / : * ? <> my| " # % & { } $ ! @ _.jpg';
then you will get safe filename my.jpg

and if you have
$file = '\ / : * ? <> | " # % & { } $ ! @ _.jpg';
then you will get safe filename .jpg

Thus the result may have very few or zero corresponding characters from the original filename
that is why i choose to force it to date even if 1 character is not transliterated

Again probably my choise is not optimal either,
e.g. it would be better if 90% of the LENGTH of original filename is preserved after
-- both (1) transliteration and (2) removal of remaining bad characters,

then accept it as safe filename ...

avatar schnuti
schnuti - comment - 23 Jun 2017

Well, as in the case of the French filename one (1), the single quote, character was the problem. m.jpg is a valid filename. The case without any character .jpg is catched at the bottom, before acceptance and gives a datetime filename . How will you count 90 % if -my.jpg or --m.jpg is used? m.jpg or my.jpg should be accepted as valid filenames. With your duplicate check it's still possible to store a lot of m.jpg ;)
I guess in the real world most cases will use very few forbidden characters but all of them might appear.

The check you do in the the loop should only give false if any "real" characters still remains. e.g if I (on a Latin system) use Greek or Bulgarian characters in my filenames.

My example was constructed around the invalid characters to test that my regex works. Not from the real world. Most of the characters are not allowed on the filesystem or are very rarely used in filenames. Personally I use the DB objects title or alias to generate the uploaded main image's name. This can be long and include some invalid characters. Those files are though still reusable.

avatar schnuti
schnuti - comment - 23 Jun 2017

Forgot
If the user sends
$file = 'My\ / : * ? <> very| " # big% & file{ } $ ! @ _.jpg';
He should get
my-very-big-file.jpg
OK to me! Not depending on how many invalid characters was removed.

avatar brianteeman
brianteeman - comment - 23 Jun 2017

And don't forget that Apple have just introduced a new file system with its own unique issues with character sets

avatar infograf768
infograf768 - comment - 23 Jun 2017

BTW, here is a pure php solution (since php 5.4) for transliteration
echo transliterator_transliterate('Any-Latin; Latin-ASCII;', 'Ολοκλήρωση.png');
will give Oloklerose.png
in Persian
transliterator_transliterate('Any-Latin; Latin-ASCII;', 'چکیده.png');
will give
chkydh.png
in Chinese
transliterator_transliterate('Any-Latin; Latin-ASCII;', '完成.png');
wan cheng.png

avatar infograf768
infograf768 - comment - 23 Jun 2017

we can also use
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', 'filename.jpg');
to only get lowercase file names

and to be sure some characters are taken care of
iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", transliterator_transliterate('Any-Latin; Latin-ASCII'; Lower()', etc.

avatar brianteeman
brianteeman - comment - 23 Jun 2017

As it is php 5.4+ then it cannot be used in j3 it will only be possible in j4

(this proposal came up before=

avatar schnuti
schnuti - comment - 23 Jun 2017

@infograf768
How do you upload files in French language to Joomla?
I get
ddimbourg-a-t-admis--lh-.jpg
for
d'Édimbourg a été admis à l'hô .jpg
Sorry for the filename I just picked some French words with some French specific characters.
To me that is a bug since J 1.5

Thanks for the link. With this transliteration we try to bypass the encoding problems.
If there are other better methods that can be introduced/backported to J3 I'm sure everyone is open to it.
As I understand there is a project that is targeting the cloud communication.
And I hope you still want to support Windows systems.

avatar infograf768
infograf768 - comment - 23 Jun 2017

I was not considering this transliteration function for J3

avatar infograf768
infograf768 - comment - 23 Jun 2017

d'Édimbourg a été admis à l'hô .jpg

I would never call a file like this. ?

avatar schnuti
schnuti - comment - 23 Jun 2017

I would never call a file like this.

Then we are two ;)

avatar infograf768
infograf768 - comment - 23 Jun 2017

@brianteeman
Using it in j4 would be great as a default transliterate to replace the simple limited one I added in 1.6 (https://github.com/joomla/joomla-cms/blob/staging/libraries/joomla/language/transliterate.php) . TTs would still be able to override it if they wish, or just take of their custom method in their localise.php.

avatar schnuti
schnuti - comment - 23 Jun 2017

Are you telling us
"Leave it as is and wait for some future solution"?

avatar infograf768
infograf768 - comment - 23 Jun 2017

Yep

avatar brianteeman
brianteeman - comment - 23 Jun 2017

That's your opinion. Yes it will be easier in j4 where we dont have the php 5.3 limitation but j3 is going to be supported for at least 2.5 more years it doesnt mean that we should give up trying to find a solution to this issue for j3

avatar schnuti
schnuti - comment - 23 Jun 2017

@ggppdk
I'll nw move my tests to J3.
I'll have to know what we agree on and what not. My approach will be the same as the gist.
I'll reimplement the language look up into the method

avatar ggppdk
ggppdk - comment - 23 Jun 2017

@schnuti

if you have time to make a new PR ?
then i will happily close this one

avatar schnuti
schnuti - comment - 24 Jun 2017

@ggppdk
I only wanted to test and help you get a solid solution. And that was the intention of my comment above. I can't really test with other languages than Latin ones. The interest in fixing things (old bugs) in J3 seems to be a little low. See some of the "funny" comments above where I didn't at all understand the first hmm with a useless link.
I'll still want to go on as I want to use it for my own transliterations anyway. I though understand if you want to close the issue. I hope not!
Until then I'll give feedback from my tests.

In my first tests with J3 I can confirm the results from the tests with J4. Everything ok with Latin characters. Are you sure about other languages character encodings?

I know that you had to use custom overrides for transliterations in J1.5 Are you sure that this is solved in J3?
One issue I found is that you can't use the datetime as is for none conform names. Any blank has to be replaced by hyphens or it will fail in the media helper media.php.

if (str_replace(' ', '', $file['name']) != $file['name'] || $file['name'] !== JFile::makeSafe($file['name']))
{

Heads up!

.

avatar schnuti
schnuti - comment - 24 Jun 2017

@ggppdk
The message given to the user should be changed
JLIB_MEDIA_ERROR_WARNFILENAME
IS:
File name must only contain alphanumeric characters and no spaces.
SHOULD depending on the solution be something like:
File name must only contain alphanumeric characters a-z, A-Z or 0-9 and no spaces.

avatar schnuti
schnuti - comment - 24 Jun 2017

@brianteeman
What is your feeling about a solution that only helps Latin languages. It solves the issue for a lot of users around the world but is not optimal.
I hope though that this proposal can fix more.

avatar schnuti
schnuti - comment - 24 Jun 2017

@ggppdk
Jippee ?
I have no real use of it but I installed the Greek language, set it to default language and uploaded a file with Greek characters. A file that completely failed without the Greek language installed.
A proof that it works as you say. I'm impressed.
Great!

avatar schnuti
schnuti - comment - 24 Jun 2017

@ggppdk
I propse a new Message if the name transliteration fails and a datetime is used.

"Your file with the not safe name Ακρόπολη Αθηνών.jpg was saved as 2017-06-24-09-51-01-000037"

or something like that.

avatar brianteeman
brianteeman - comment - 24 Jun 2017

@brianteeman
What is your feeling about a solution that only helps Latin languages.

Personally I would not be in favour of anything that makes joomla different depending on language

avatar infograf768
infograf768 - comment - 24 Jun 2017

The other solution in J3 is to puny encode file names when they are not latin.
They could be saved puny encoded and displayed decoded (or not) in the Media Manager.
Example:
點看.jpg would be saved as xn--c1yn36f.jpg
Ακρόπολη Αθηνών.jpg would give xn-- -ylbbwcfkixaolm8mpb.jpg

This would evidently not solve the issue of adding manually a file with its utf8 name in the images folder.

avatar ggppdk
ggppdk - comment - 24 Jun 2017

Quoting

we can also use
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', 'filename.jpg');
to only get lowercase file names

and to be sure some characters are taken care of
iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", transliterator_transliterate('Any-Latin; Latin-ASCII'; Lower()', etc.

So is it bad if we use for PHP 5.4+ ?
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', 'filename.jpg');

and if older version then fallback to current behaviour ?

avatar infograf768
infograf768 - comment - 24 Jun 2017

Just tested on MAMP with php 5.4.4, and I get
Call to undefined function transliterator_transliterate()
In php 7.0.8 it works fine. Looks like there is an issue with php 5.4.x

[EDIT] Looks like PHP needs the intl extension to be present to use transliterator.

avatar infograf768
infograf768 - comment - 24 Jun 2017

@ggppdk @schnuti
Here is a .diff of a proof of concept depending on the presence of iconv and transliterator_transliterate php functions.

I modified the MediaControllerFile as JFile::makeSafe is used in other parts of core and I think it is safer to not touch at the method.

filename.diff.zip

avatar infograf768
infograf768 - comment - 27 Jun 2017

Please test patch #16878

avatar infograf768
infograf768 - comment - 27 Jun 2017

And as well #16880

avatar ggppdk
ggppdk - comment - 27 Jun 2017

@infograf768

nice, ?
i will close this PR in favour of them

avatar ggppdk ggppdk - change - 27 Jun 2017
Status Needs Review Closed
Closed_Date 0000-00-00 00:00:00 2017-06-27 10:08:50
Closed_By ggppdk
avatar ggppdk ggppdk - close - 27 Jun 2017

Add a Comment

Login with GitHub to post a comment