? ? Success

User tests: Successful: Unsuccessful:

avatar Hackwar
Hackwar
16 Jun 2018

This PR adds a way to add/remove common words in Smart Search during installation of language packs.

Smart Search has a table for common words per language, which Smart Search filters out in the suggestions. (In a coming PR, these common words can also be filtered out of the index itself in order to keep the index smaller.) Up till now, there was no way for language packs to add their common words to that table and in general no way to update those words. (Although that should rarely be necessary. These lists are pretty static.)

I've added an additional column "custom" to the #__finder_terms_common table to mark words as coming from a language pack (=0) or having been inserted by a user (=1). There currently is no interface for a user to add their own words. Depending on the discussion, this could be added as an option in the component configuration.

I also added a finder plugin in the extension group. Upon installing an extension, this plugin looks if this is a language and if that language comes with a *.com_finder.commonwords.txt file. It then reads that file and adds its content to the #__finder_terms_common table. When a language is updated, the entries for that language from a language pack are deleted and re-added. Upon uninstalling, these entries are also deleted. That file can be put both in the frontend and backend language folder. If it is in both, the plugin will simply run twice for the frontend and backend. Since these steps are very rarely executed, this waste of resources should still be okay.

The *.com_finder.commonwords.txt file is a simple text file with a word per line. Everything following a semi-colon (;) is considered a comment and comments are ignored. Whitespace and empty lines are also ignored. The file has to be saved as UTF-8.

I extended the list of the common words for english to the list proposed here: http://snowball.tartarus.org/ This site would also be the source for a bunch of other languages. To test and give more examples, I also modified the attached translation package for german with the list from that site.
de-DE_joomla_lang_full_3.8.8v1.zip

avatar Hackwar Hackwar - open - 16 Jun 2018
avatar Hackwar Hackwar - change - 16 Jun 2018
Status New Pending
avatar joomla-cms-bot joomla-cms-bot - change - 16 Jun 2018
Category SQL Administration com_admin Postgresql Language & Strings Installation Front End Plugins
avatar Hackwar Hackwar - change - 16 Jun 2018
Labels Added: ? ?
avatar brianteeman
brianteeman - comment - 16 Jun 2018

Very cool idea!!!

avatar carlitorweb
carlitorweb - comment - 16 Jun 2018

Awesome, like it!!!

6b416a6 16 Jun 2018 avatar Hackwar Typo
avatar infograf768
infograf768 - comment - 17 Jun 2018

@Hackwar Need feedback please

Questions
Is this totally independant from the stemmers? I mean is it possibly useful for other languages than the ones who have stemmers?

What happens if the language pack contains the en-GB list of common words instead of a specific list for the language. Or if the file exists but is empty because the TT does not know what to enter in the file? I am asking this as the file will be proposed for translation on Crowdin.

Did not find on tartarus the list for French for example. Can you point the specific page for me?
EDIT: Is it what they call stopwords? http://snowball.tartarus.org/algorithms/french/stop.txt

avatar Hackwar
Hackwar - comment - 17 Jun 2018

This has nothing to do with stemmers and is totally independent. It is a list of words that we do not want to index because they give no additional value in the search. In the phrase "le voiture", the "le" will not be really relevant in any search. We can simply filter them out without loosing any search precision.

If the language pack contains the english stopwords, then those words will be added to the table with the respective languages code. That doesn't really hurt, but it doesn't help either. If the file is empty, nothing is added.

The list is called the stopwords list indeed.

avatar infograf768
infograf768 - comment - 17 Jun 2018

Another question:
Let's say en-US (or fr-CA for example), contain a different list of common words than en-GB (or fr-FR).
What will happen? At reading your original comments, it looks like everything related to the "en" or "fr" tag is deleted when a language is installed or updated and then re-added. Looks to me that the last en-XX list updated or installed will therefore override any other already present.

EDIT: this means that if en-US contains only 4 words, then all the words entered in db by en-GB will be deleted if en-US is updated or installed after en-GB

avatar Bakual
Bakual - comment - 17 Jun 2018

The *.com_finder.commonwords.txt file is a simple text file with a word per line. Everything following a semi-colon (;) is considered a comment and comments are ignored. Whitespace and empty lines are also ignored. The file has to be saved as UTF-8.

Do the words have to be each per line? If so it will not work in Crowdin (basically same thing as explained in the stemmer PR). In Crowdin a txt file is translated by sentences and in absence of a sentence it's done per line. You can't add or remove lines (content) there, just translate what is there in source.
And yes, this is already an issue in the existing localise.php file.

If the list can be done as a comaseparated (or whatever suits best, just not dots or line breaks) list of entries, then it should work in Crowdin I think. It probably will warn the translator because of the different amount of commas but should let you save it anyway.

avatar Hackwar
Hackwar - comment - 17 Jun 2018

Correct and that is perfectly fine. Again, this is not an exact science, it is not a vital feature. This list of words creates very slightly better search results and means slight performance improvements. Again, don't think of the list as words, think about them as binary data or whatever. Quite honestly, if we will have another discussion about this with 150 messages on how lists can be managed, etc., then I rather remove the whole table and all the related stuff. We can overthink stuff... Especially when the whole world is doing it one way and Joomla seems to want to again re-invent the wheel here.

@Bakual it is a static file. If Crowdin does not support us to add static files to language packs, then we have to look for a different platform.

avatar Bakual
Bakual - comment - 17 Jun 2018

If Crowdin does not support us to add static files to language packs, then we have to look for a different platform.

Don't open the Pandora's Box.
Using Crowdin as the translation tool is a decision made by the Production Department.
Honestly I doubt other tools (eg Trransifex) allow to host "static" files. After all they're translation tools, not code repos ?. It's just not their business.
So no, moving to another translation platform is no option for the time being.

avatar dgrammatiko
dgrammatiko - comment - 17 Jun 2018

@Hackwar @Bakual you don't need the .txt file for the crowdin. Create another .ini file and use the code from #19772 to transform it to .txt, so crowdin is happy and the project is also happy.

avatar Bakual
Bakual - comment - 17 Jun 2018

@dgrammatiko Dimitris, did you even read that this PR is about? The code from your PR doesn't help here at all since it's something completely different. Having those words in an INI file would actually be even worse than having it in a txt file.

avatar Hackwar
Hackwar - comment - 5 Aug 2018

And now?

avatar TobsBobs TobsBobs - test_item - 5 Aug 2018 - Tested unsuccessfully
avatar TobsBobs
TobsBobs - comment - 5 Aug 2018

I have tested this item ? unsuccessfully on 858669f

I installed the patch and added the sql files.
The extension_id 490 is no longer available for this extension. I changed the id to 491.
I installed the language pack and looked in the table finder_terms_common but no german words were added.
I hope I did everything right.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20781.

avatar infograf768
infograf768 - comment - 6 Aug 2018

I installed the language pack and looked in the table finder_terms_common but no german words were added.

You can't get common words with our present language packs.
This PR needs a specific .txt file in each language pack using a specific formatting for this to work.
The problem with this .txt file formatting has been explained above as Translation Tools just can't take care of it because it is not a translation.

That is why this solution can't be used by our CMS.

avatar TobsBobs
TobsBobs - comment - 6 Aug 2018

I extended the list of the common words for english to the list proposed here: http://snowball.tartarus.org/ This site would also be the source for a bunch of other languages. To test and give more examples, I also modified the attached translation package for german with the list from that site. de-DE_joomla_lang_full_3.8.8v1.zip

What is this @infograf768 ? A speciel translation package!

avatar infograf768
infograf768 - comment - 6 Aug 2018

What is this @infograf768 ? A speciel translation package!

LOL. Apologies. Did not see that.

avatar Hackwar
Hackwar - comment - 6 Aug 2018

@TobsBobs Can you have a look if the plugin is enabled?

avatar TobsBobs
TobsBobs - comment - 6 Aug 2018

@Hackwar The plugin is/was enabled.

avatar Hackwar
Hackwar - comment - 21 Sep 2018

I checked this and there was an issue with the SQL table. The SQL table was created with collation utf8mb4_general_ci, which meant that non-latin characters for indexes were transliterated. That resulted in the words das and daß having the same index, violating the unique constraint. Anyway, this should work now. Please test this PR. ?

avatar TobsBobs
TobsBobs - comment - 21 Sep 2018

I will test this PR on sunday.

avatar infograf768
infograf768 - comment - 22 Sep 2018

We still can't add such a file through crowdin...

avatar Hackwar
Hackwar - comment - 22 Sep 2018

Then find a different way. Sorry, but it sounds like crowdin can only translate rows one on one and then there is no way that we can handle such a list of common words through crowdin. Just the fact that there is the english word "the", which in german could be "der, die, das" prevents that. Seems like Crowdin is a pretty crappy tool...

avatar TobsBobs
TobsBobs - comment - 23 Sep 2018

I can not apply this pr. I think this is becaus this branch is out-of-date with the base branch.

avatar Hackwar
Hackwar - comment - 23 Sep 2018

@TobsBobs please notice that the pulltester does not work with this PR. The out-of-date message has nothing to do with this. I'll still update it to the latest version.

avatar TobsBobs
TobsBobs - comment - 23 Sep 2018

I know this. I wanted to install the patch like this:

curl https://patch-diff.githubusercontent.com/raw/joomla/joomla-cms/pull/20781.diff
git apply 20781.diff

I do not know a simpler way.

avatar Hackwar
Hackwar - comment - 1 Feb 2019

@wilsonge Can you have a look at this and decide which way to go?

avatar wilsonge
wilsonge - comment - 14 Mar 2019

Closing this right now as it's been merged into the code base as part of another PR by accident. Discussions ongoing as how to resolve the issues presented here

avatar wilsonge wilsonge - change - 14 Mar 2019
Status Pending Closed
Closed_Date 0000-00-00 00:00:00 2019-03-14 11:21:39
Closed_By wilsonge
avatar wilsonge wilsonge - close - 14 Mar 2019

Add a Comment

Login with GitHub to post a comment