User tests: Successful: Unsuccessful:
This PR adds a way to add/remove common words in Smart Search during installation of language packs.
Smart Search has a table for common words per language, which Smart Search filters out in the suggestions. (In a coming PR, these common words can also be filtered out of the index itself in order to keep the index smaller.) Up till now, there was no way for language packs to add their common words to that table and in general no way to update those words. (Although that should rarely be necessary. These lists are pretty static.)
I've added an additional column "custom" to the #__finder_terms_common
table to mark words as coming from a language pack (=0) or having been inserted by a user (=1). There currently is no interface for a user to add their own words. Depending on the discussion, this could be added as an option in the component configuration.
I also added a finder
plugin in the extension
group. Upon installing an extension, this plugin looks if this is a language and if that language comes with a *.com_finder.commonwords.txt file. It then reads that file and adds its content to the #__finder_terms_common
table. When a language is updated, the entries for that language from a language pack are deleted and re-added. Upon uninstalling, these entries are also deleted. That file can be put both in the frontend and backend language folder. If it is in both, the plugin will simply run twice for the frontend and backend. Since these steps are very rarely executed, this waste of resources should still be okay.
The *.com_finder.commonwords.txt
file is a simple text file with a word per line. Everything following a semi-colon (;
) is considered a comment and comments are ignored. Whitespace and empty lines are also ignored. The file has to be saved as UTF-8.
I extended the list of the common words for english to the list proposed here: http://snowball.tartarus.org/ This site would also be the source for a bunch of other languages. To test and give more examples, I also modified the attached translation package for german with the list from that site.
de-DE_joomla_lang_full_3.8.8v1.zip
Status | New | ⇒ | Pending |
Category | ⇒ | SQL Administration com_admin Postgresql Language & Strings Installation Front End Plugins |
Labels |
Added:
?
?
|
Awesome, like it!!!
@Hackwar Need feedback please
Questions
Is this totally independant from the stemmers? I mean is it possibly useful for other languages than the ones who have stemmers?
What happens if the language pack contains the en-GB list of common words instead of a specific list for the language. Or if the file exists but is empty because the TT does not know what to enter in the file? I am asking this as the file will be proposed for translation on Crowdin.
Did not find on tartarus the list for French for example. Can you point the specific page for me?
EDIT: Is it what they call stopwords? http://snowball.tartarus.org/algorithms/french/stop.txt
This has nothing to do with stemmers and is totally independent. It is a list of words that we do not want to index because they give no additional value in the search. In the phrase "le voiture", the "le" will not be really relevant in any search. We can simply filter them out without loosing any search precision.
If the language pack contains the english stopwords, then those words will be added to the table with the respective languages code. That doesn't really hurt, but it doesn't help either. If the file is empty, nothing is added.
The list is called the stopwords list indeed.
Another question:
Let's say en-US (or fr-CA for example), contain a different list of common words than en-GB (or fr-FR).
What will happen? At reading your original comments, it looks like everything related to the "en" or "fr" tag is deleted when a language is installed or updated and then re-added. Looks to me that the last en-XX list updated or installed will therefore override any other already present.
EDIT: this means that if en-US contains only 4 words, then all the words entered in db by en-GB will be deleted if en-US is updated or installed after en-GB
The *.com_finder.commonwords.txt file is a simple text file with a word per line. Everything following a semi-colon (;) is considered a comment and comments are ignored. Whitespace and empty lines are also ignored. The file has to be saved as UTF-8.
Do the words have to be each per line? If so it will not work in Crowdin (basically same thing as explained in the stemmer PR). In Crowdin a txt file is translated by sentences and in absence of a sentence it's done per line. You can't add or remove lines (content) there, just translate what is there in source.
And yes, this is already an issue in the existing localise.php file.
If the list can be done as a comaseparated (or whatever suits best, just not dots or line breaks) list of entries, then it should work in Crowdin I think. It probably will warn the translator because of the different amount of commas but should let you save it anyway.
Correct and that is perfectly fine. Again, this is not an exact science, it is not a vital feature. This list of words creates very slightly better search results and means slight performance improvements. Again, don't think of the list as words, think about them as binary data or whatever. Quite honestly, if we will have another discussion about this with 150 messages on how lists can be managed, etc., then I rather remove the whole table and all the related stuff. We can overthink stuff... Especially when the whole world is doing it one way and Joomla seems to want to again re-invent the wheel here.
@Bakual it is a static file. If Crowdin does not support us to add static files to language packs, then we have to look for a different platform.
If Crowdin does not support us to add static files to language packs, then we have to look for a different platform.
Don't open the Pandora's Box.
Using Crowdin as the translation tool is a decision made by the Production Department.
Honestly I doubt other tools (eg Trransifex) allow to host "static" files. After all they're translation tools, not code repos
So no, moving to another translation platform is no option for the time being.
@dgrammatiko Dimitris, did you even read that this PR is about? The code from your PR doesn't help here at all since it's something completely different. Having those words in an INI file would actually be even worse than having it in a txt file.
And now?
I have tested this item
I installed the patch and added the sql files.
The extension_id 490 is no longer available for this extension. I changed the id to 491.
I installed the language pack and looked in the table finder_terms_common but no german words were added.
I hope I did everything right.
I installed the language pack and looked in the table finder_terms_common but no german words were added.
You can't get common words with our present language packs.
This PR needs a specific .txt file in each language pack using a specific formatting for this to work.
The problem with this .txt file formatting has been explained above as Translation Tools just can't take care of it because it is not a translation.
That is why this solution can't be used by our CMS.
I extended the list of the common words for english to the list proposed here: http://snowball.tartarus.org/ This site would also be the source for a bunch of other languages. To test and give more examples, I also modified the attached translation package for german with the list from that site. de-DE_joomla_lang_full_3.8.8v1.zip
What is this @infograf768 ? A speciel translation package!
What is this @infograf768 ? A speciel translation package!
LOL. Apologies. Did not see that.
I checked this and there was an issue with the SQL table. The SQL table was created with collation utf8mb4_general_ci, which meant that non-latin characters for indexes were transliterated. That resulted in the words das
and daß
having the same index, violating the unique constraint. Anyway, this should work now. Please test this PR.
I will test this PR on sunday.
We still can't add such a file through crowdin...
Then find a different way. Sorry, but it sounds like crowdin can only translate rows one on one and then there is no way that we can handle such a list of common words through crowdin. Just the fact that there is the english word "the", which in german could be "der, die, das" prevents that. Seems like Crowdin is a pretty crappy tool...
I can not apply this pr. I think this is becaus this branch is out-of-date with the base branch.
I know this. I wanted to install the patch like this:
curl https://patch-diff.githubusercontent.com/raw/joomla/joomla-cms/pull/20781.diff
git apply 20781.diff
I do not know a simpler way.
Closing this right now as it's been merged into the code base as part of another PR by accident. Discussions ongoing as how to resolve the issues presented here
Status | Pending | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2019-03-14 11:21:39 |
Closed_By | ⇒ | wilsonge |
Very cool idea!!!