Joomla! Issue Tracker | Joomla! CMS #20391 - [4.0] Using language specific tokeniser and stemmer for com

? ? ? ?

Fixed in Code Base
20 Jun 2018
Medium
Build: 4.0-dev
# 20391
Diff
Hackwar:j4finder_language

Pending

Pending Hound Hound is busy reviewing changes... Details

User tests: Successful: Unsuccessful:

Hackwar
14 May 2018

Tokenisation

In order to build the index, com_finder has to tokenise the input from a string into single terms. Right now com_finder here basically only really supports latin based languages and chinese (by a hack). Languages that do not seperate words by white space are not supported. This PR introduces a FinderIndexerLanguage class, which tokenises strings and which can be overridden by language specific tokenisers, so that all languages can be supported. The different tokenisation scripts would have to be supported by the translation teams.

Stemming

com_finder also supports stemming words in order to improve the search results. Stemming reduces a term to its base and thus allows for searching both singular and plural variants of a word with the same search term or different timecases of a word. Unfortunately com_finder up till now only had 3 stemmers, which supported only a small range of languages. Again, this has been outsourced into a file in the language pack.

Providing language specific tokeniser and stemmer

Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.

How to test

Unfortunately this is a pretty opaque change as long as there is only the en-GB implementation of the language support file. Simply said, you should be able to take a current 4.0-dev, test indexing and searching, apply these changes and do the same tests, seeing that you got the same results.

fe22e20 14 May 2018

Implementing language specific tokeniser and stemmer for com_finder

d95ee89 14 May 2018

Fixing comments

Hackwar - open - 14 May 2018

Hackwar - change - 14 May 2018

Status

New

⇒

Pending

joomla-cms-bot - change - 14 May 2018

Category

⇒

Administration com_finder Front End Language & Strings

Hackwar - comment - 14 May 2018

Should we rename methods named "tokenize" to "tokenise" while we are at it?

2f10b16 14 May 2018

Fixing codestyle

Hackwar - change - 14 May 2018

Labels

Added: ? ?

infograf768 - comment - 14 May 2018

Translation teams would have to ship a {language}.finder.php file which extends from FinderIndexerLanguage. See the en-GB language pack for an example of adding your own stemmer.

Where is that en-GB example? What will happen when a language has no language.finder.php

What happens when a site has a mix of languages in its contents? or even is pure multilingual?

Hackwar - comment - 14 May 2018

The en-GB example is in /language/en-GB/en-GB.finder.php and translation teams would have to add their version of that file in that same place in their language folder in their language.

When a language does not have such a file, it will fall back on the latin based tokenisation that we are using right now. So tokenisation only has to be overriden for languages that will not work with the white-space-based tokenisation. Stemming is language specific and if not implemented, will only return the original term instead. So translation teams don't HAVE to provide such a file, but their language would benefit from this.

If a site has several content items with different languages, it will choose the right tokeniser and stemmer based on the language of that content item. If people mix languages in a content item that has been marked with a specific language, it will use that tokeniser (which normally shouldn't be an issue) and that stemmer. (What happens then would be dependent on the stemmer implementation. Normally that isn't a problem. In worst case you don't get stemmed words and thus can only search for the specific words.) If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled.

infograf768 - comment - 14 May 2018

I hope you are also going to propose the stemmers as I doubt any TT would be able to code this. I include myself in that group.

Hackwar - comment - 14 May 2018

Here is a good implementation of stemmers for several languages: https://github.com/wamania/php-stemmer
That should cover all languages that we are supporting today and a few more. There are more stemmers out there that can be used.

brianteeman - comment - 14 May 2018

if i read the changes correctly then this PR removes existing stemmers eg for french

Bakual - comment - 14 May 2018

I see two issues:
First is like JM said that no TT will be able to write that, even if example code is there. TTs will not understand code which hardly any Joomla contributor understands ?
Second is that you introduce a new PHP file in the language directory. I would rather love to see us getting rid of PHP files and use different technics if somehow possible. Because translation tools in general struggle to deal with that.
In any case, it would imho make more sense to move the function into the existing localise.php file and use existing workflows to deal with it.

The idea of this PR is certainly good and valid, but the proposal in its current state isn't going to work out well in practice simply due to translaters not being coders.

Hackwar - comment - 14 May 2018

@brianteeman Yes, this removes the old stemmers, since the translation packages need to provide those. Up to this PR, we effectively only supported english and french, with this PR these at least can be extended for every language out there.

@Bakual The stemmer (and in parts the tokeniser) are depending on the respective language and unless we want to go the same way like we did in the past and only support X languages, this has to be made modular and be installed by default when users install a language.

Regarding the coding of a stemmer: Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them. As I said, there are stemmers for most languages out there. I'd be happy to provide translators with the necessary code for at least all the stemmers that the project that I linked above provides. That would bring us from 2 to 12 supported languages. Also keep in mind that stemming is not a requirement, but an optional feature.

Stemmer and tokeniser for a specific component don't have any business in a global localisation file. (Yes, we should remove the methods that are com_search specific from those localise.php files.) If you have a better solution where to put the code, then I'm all ears, but it has to be part of the translation/localisation package, otherwise the localisation simply wouldn't be complete. (And again, this file is not mandatory, but an optional feature.)

Bakual - comment - 14 May 2018

Maybe translators will have issues with coding that themselfs, but in that case I would expect us to support them in that and maybe provide those files for them.

If we (read "You") have to write them anyway, we can as well just put them into core. Same result in the end.
The feature is nice, but a feature that never gets used because nobody will be able to use it, is useless.

this file is not mandatory, but an optional feature

Theoretically yes. But the file will be on Crowdin and a language will only ever be 100% translated if that file is also "translated". And it has to be working code or the language pack will be broken.
So in practice it will be a mandatory file for translation which nobody is able to "translate".

Again, I think the idea is nice and in theory everything is fine. It's the practice where it will fail.

infograf768 - comment - 14 May 2018

As I said, there are stemmers for most languages out there.

I hardly can consider most languages there in https://github.com/wamania/php-stemmer#languages (12 only as of today). Only one using non-latin alphabet.
Looks also that the code there would have to be modified to fit (as far as I can see by comparing with the en-GB.finder.php included here).
As for white-space-tokenisation I see no language there who do not use the white space and none at all using multibyte spaces. What happens with Chinese for example?

I certainly would love it if this was practically feasible for the majority of J languages and, as @Bakual rightfully says, independent from lang packs.

brianteeman - comment - 14 May 2018

although to be fair the current codebase doesnt support those languages either

Hackwar - comment - 14 May 2018

Sorry, I don't get you guys. As brian wrote, right now we don't support none-white-space languages and in terms of stemming, we only support english and french and there is no way to fix that. Now I was so foolish to provide a solution that allows translation packs to provide the best (optional) solution for their language for both situations and you are shooting this down because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there.
If you are worried about not having the nice "100% complete" behind the language, then people can simply put an empty class in the file if you really need that file to get to a 100%.
Yes, that one project "only" supports 12 stemmers. There are more projects out there than that one. And again: If there is no stemming algorithm out there for a language, then there is nothing that we can do. But right now I really would like to be able to use com_finder with stemming on a german, spanish or russian website.

Bakual - comment - 14 May 2018

I'm not shooting it down. Again I think it's a valid idea and I like the idea. I'm just saying that in its current form it will not work out as expected because it will not work in practice.

because I'm not providing ready-made solutions for all 74 languages that we have translation teams for out there.

That point was to make it clear the current proposal wasn't going to work. Because I'm absolutely clear that you will not provide this, And Michael will also not do that. And then we've run out of developers who understand the stemmer.

If you are worried about not having the nice "100% complete" behind the language, then people can simply put an empty class in the file if you really need that file to get to a 100%.

It's not about reaching an arbitary label. It's important to have a language at 100%. If we ever want to automate something around language packs, this will trigger it (currently it already triggers the building of the pack).
Having to create an empty file so our prcoess works sounds quite stupid. And even creating that empty file is not that easy in Crowdin.

I'm not trying to shoot it down. I'm trying to get it improved somehow. I just don't know how yet.

infograf768 - comment - 14 May 2018

@Hackwar

If a content item is marked as '*' (=All), the white-space-tokenisation is used and stemming is disabled.

I do not understand this. As you should know, when multilingual is not implemented in 4.0, then all content items are marked as ‘*’.

Does that mean that the stemmers would only work when the site is muultilingual?
I am confused

ReLater - comment - 14 May 2018

There are so many changes in J4 where I say only some isolated programmers and experts understand them and that others (also extension providers!) can*t adapt/use them in the future without learning/understanding lots of completely new stuff ... ... or have to quit ... ... or have to fall back into Joomla3 times and techniques.

Thus, for me arguments like "poor supporters won't understand how to" are not really comprehensible here.

BTW: Nearly nobody understood the stemmer files before this pr, too. Not to mention how to integrate them. That's the reason why we only have 2 language specific stemmers for years now in Joomla ;-)

ggppdk - comment - 15 May 2018

This PR looks like a major improvement, that will be usable not only in com_finder but also elsewhere (by 3rd party too !!)

How does the current solution make it any "easier" for translation teams to provide a word stemmer ?
Also what would an "easier" solution be ?

Stemmers need to be coded, right ? there will never be an easy solution
So i don't understand the "difficult" argument

Also why there is need to involve the translation teams ?
12 language packs will just call the 3rd party code: e.g.
https://github.com/wamania/php-stemmer
which we should include in core libraries

And any other language packs that want this ... will need a programmer to add it,
and then the translation teams will never touch it, as they will be told that it is file that can only be done by a programmer

because stemmer (and tokeniser), are not like language strings that are frequently updated

Bakual - comment - 15 May 2018

And any other language packs that want this ... will need a programmer to add it,
and then the translation teams will never touch it, as they will be told that it is file that can only be done by a programmer

It's one thing to write the stemmer. Assuming the team finds a developer who can write it (doubtfully), we still can't integrate the file into Crowdin. Which means our workflow (as decided by OSM) will be broken. We would need to host that file somewhere on GitHub and have an own build process for the language packs, while currently the packs are created by Crowdin. If someone wants to write a custom solution for this, we can talk further. Or maybe someone else has an idea how we can include an optional code (pseudo-translation) file in Crowdin.

infograf768 - comment - 15 May 2018

@Bakual
as far as I understood, @ggppdk is not suggesting that the xx-XX.finder.php be part of the language packs, but available in a library. Therefore Crowdin would be out of the scope.
I agree with that kind of solution, but still waiting for Hannes to explain how precisely this is supposed to work with multilingual/monolingual sites.

Hackwar - comment - 20 May 2018

@infograf768 So far there is no plan/defined behavior for non-multilanguage sites. Right now (regardless of this change or not), Finder/Smart Search reads the language of the content item and nothing else. If the language is not set or is *, then it wont stem the word. Finder/Smart Search does not care if the site is set to be mono- or multi-lingual. That would be yet another (indeed necessary) change for this component, but mostly irrelevant to this PR.

Again, as I wrote before, it is nice that you bring up all the issues in Finder/Smart Search, but this PR is not meant to fix all of those, but concentrates on making tokenising and stemming possible for every language out there. I'm very much willing to work on this component and its plugins more, but I refuse to blow up the scope of the current PRs even further, just because you now notice that Finder/Smart Search has lots of issues. The changes that I'm proposing are only possible now with Joomla 4.0, otherwise we have to wait till Joomla 5.0 and that would simply be to long. So if we please could look at the current PRs, decide if they do what I claim they do, decide if that is an improvement or not, then test if they do their work correctly and then merge those, then I can get back to adressing the other issues of Finder/Smart Search. But if these PRs stay in limbo for the next year, then get closed because they are outdated and/or 4.0 is released, then we would loose a valuable opportunity to actually improve Joomla here. The longterm goal for 4.0 for me is to get rid of com_search, make com_finder useable and make it useable for all languages. That wont happen if PRs get put on hold because people noticed further issues which are out of scope of the original PR... </rant>

Hackwar - comment - 21 May 2018

Just to list all current PRs for Finder/Smart Search:

#20185 [4.0] Removing artificial sharding from com_finder
#20384 [4.0] Removing terms tuples from com_finder
#20391 [4.0] Using language specific tokeniser and stemmer for com_finder

infograf768 - comment - 21 May 2018

Honestly, if the stemmers are not used at all for monolanguage sites (through a parameter at least), this is totally useless.
In any case, I stand by the fact that these new stemmers do NOT need to be included in the language packs.
One goes with the other.

ggppdk - comment - 21 May 2018

Why not to use stemmer in monolanguage site ? can someone explain why not to use ?

It is sane to assume that the language of article is such website is

$lang = JComponentHelper::getParams('com_languages')->get('site', 'en-GB');

I mean why would anyone have a different UI language than the language of articles in a monolanguage site ?

brianteeman - comment - 21 May 2018

I mean why would anyone have a different UI language than the language of articles in a monolanguage site ?

Lots of reasons

franz-wohlkoenig - comment - 21 May 2018

I mean why would anyone have a different UI language than the language of articles in a monolanguage site ?

For Example:

English Words are often shorter than German so th UI is smarter,
German Translation takes a while
make Screenshots for Tests using English UI is better for others

chrisdavenport - comment - 21 May 2018

The stemmer should certainly work on mono-lingual sites. I've definitely set up mono-lingual English sites with the porter stemmer enabled and working in the past, without having to specify the language on each article. So if it's not working now I'd consider it a bug.

Hackwar - comment - 21 May 2018

I'm pretty sure that no one is saying that a monolingual website should not use the stemmer. I am just saying that that is a different bug than what I'm trying to fix here and I don't want to delay THIS PR with a lengthy discussion on how to fix that other bug. It is another bug (and it is a bug indeed), but that should be fixed in another PR.

infograf768 - comment - 21 May 2018

It is another bug (and it is a bug indeed), but that should be fixed in another PR.

Wait a minute! We are not totally idiots here...
What you are describing for this PR is wrong, at the same time for the implementation (specific files per language pack) and not yet designed for monolingual and you want this PR to be merged ?

infograf768 - comment - 21 May 2018

Let's not forget!!!
#18972

Hackwar - comment - 21 May 2018

I will stop responding to this thread until a maintainer has reviewed this and given a decision on how to proceed with the feature in THIS pull requrest.

Bakual - comment - 22 May 2018

@Hackwar Thinking about this for a while now. What do you think of allowing language specific stemmers being installed as an own extension. Eg as a plugin of some sort or a library file extension. That way it is not part of the regular language pack (which we can't handle), but it could be packaged into one or installed separately.

csthomas - comment - 22 May 2018

... What do you think of allowing language specific stemmers being installed as an own extension. Eg as a plugin ...

This should be a good idea, for example a plugin in the finder folder that will register itself in the finder component.

Hackwar - comment - 23 May 2018

#20562 implements stemming for mono-lingual sites.

infograf768 - comment - 28 May 2018

For the sake of it I nevertheless wanted to test this (to see if it would be simple to just put the language stemmers where they could be, i.e. in this case administrator/comonents/com_finder/helpers/indexer/stemmer/language/en-GB/en-GB.finder.php by modifying the path in language.php)

Therefore I just tested the PR on a 4-0dev multilingual test site without any change.
I got a js error when indexing

ReferenceError: assignment to undeclared variable json[Learn More]
indexer.min.js:1:1297
m
http://localhost:8888/installmulti/joomla40/media/com_finder/js/indexer.min.js:1:1297
onError
http://localhost:8888/installmulti/joomla40/media/com_finder/js/indexer.min.js:1:434
e.request/n.onreadystatechange
http://localhost:8888/installmulti/joomla40/media/system/js/core.min.js:1:7777

After waiting for a while, I closed the modal and this is what I got:

Only categories and only one of my en-GB categories. No French categories. No articles... etc.

Hackwar - comment - 28 May 2018

I checked this and among all my changes, I missed that this PR does depend on the database changes of #20185. It would also benefit if we could get #20561 merged. The issue is, that the language columns of the finder tables are not consistent and often enough to short for the full language tag. With #20185 I normalised all language columns to the standard that we have everywhere else. #20561 goes a step further and normalises the rest of the columns, too. So either we have to first merge #20185 or you have to pull that change into your current branch, too.

infograf768 - comment - 28 May 2018

Will test on clean install, but without #20561 as it is waiting for a correction.

infograf768 - comment - 28 May 2018

Impossible to merge both #20185 and this #20391 as there is a conflict for SearchModel.php
Will try Indexing only

laoneo - comment - 28 May 2018

I mean if we use the stemmer as library (guess plugin would be too much as they have no options, shouldn't be disabled and have no access levels), it doesn't mean that when a new language is installed through the back end, that the stemmer library is not installed as part of the process. Wer can extend that process to also install the stemmers form a different source.

infograf768 - comment - 28 May 2018

Indexing looks like working
After correcting SearchModel.php, I get very weird results.
Basically, it looks like the absence of a stemmer for a language breaks the suggestions.

On a pure 4.0-dev site using default snowball, I get

Suggestions do contain indeed some stuff not related to the French items but it also contains the text I am looking for.

With both your PRs

On a new clean install with #20185 and this #20391 I get for a search in French

or

(Where the hell does this come from?)

When I succeed to search i.e. keeping focus in the field without using possible suggestions, the word found is not highlighted...

And in English

Concerning stemmers, if this works, I changed my mind. They should be placed in
/media/com_finder/language/xx-XX/xx-XX.finder.php

I am afraid (even when suggestions will be corrected somehow as the search field is totally instable) that we will be in a worse situation than before (in 3.x) as a language without stemmer does not propose anything while snowball was not perfect but did work for latin languages with spaces.
We can't merge this half baked PR without
A. Modifying the path to the stemmers and deciding to include as they come all of them in core
B. Being able to test truely what we get as suggestions for various languages, with and without specific stemmers, reusing snowball when no stemmers available.

infograf768 - comment - 28 May 2018

Some bugs stated above may also come from the fact that multilingual is broken in frontend.

If we modify languagefilter a bit
$router->attachBuildRule(array($this, 'buildRule'), Router::PROCESS_BEFORE);
(which does not solve all, see #19997 (comment) ), we can get this

Hackwar - comment - 28 May 2018

Snowball does not work in 4.0, since Snowball is not supported anymore in PHP 7.x. Right now, Joomla 4.0 only supports english and french. There is also no way that we can ship all the possible stemmers out there with Joomla. Not all stemmers are available in PHP, not all languages can be stemmed and quite frankly, I refuse to go ahead and do all the research for at least the 72 languages that we support right now to find if stemming for them is possible, if they have a stemmer in PHP, adapting that to the interface that we provide and testing the implementation if it actually does what it does.

If your answer to this PR is "We need to support all the languages out there out of the box!!!11", then my response is "Ok, then we will be stuck forever with english and french."

Issues of suggest are not related to this.

infograf768 - comment - 28 May 2018

Look, we can't really test anything if multilingual is not working.
Also, we can't test anything if we can't compare with multiple stemmers.

Therefore, I suggest

Solve partly the multilingual issue (here is a small patch to change at least the most important error in languagefilter : let's get this merged fast #20599
Change the path for the stemmers to $path = JPATH_ROOT . '/media/com_finder/language/' . $language . '/' . $language . '.finder.php';
Please, as you seem to be the only one able to create a stemmer, create 2 other stemmers that we can test: for de-DE and fr-FR and place them in media with the moved en-GB one

@wilsonge
You asked for a practical proposal. Here it is. Let's get as many stemmers as possible in core.
If we get a new stemmer between releases for a specific existing language pack, we will help the TT to install it in media with a new version of the pack. That is extremely simple with media destination=
But evidently we first have to test fully this proposal.

Hackwar - comment - 28 May 2018

/media is not the right place for this. PHP files have no business in the /media folder, simply because it is code and not media.

I desperately hope that I'm not the only one in this project that can take an existing class and adapt it to our interface. If I'm wrong, this project is going to fail soon and fast...

Stemmers are still logically a part of the language packs, but what do I know...

I will provide stemmers for french and german soon.

Bakual - comment - 28 May 2018

Stemmers are still logically a part of the language packs, but what do I know...

In theory, the language pack is the right place, yes. In practice, the language folder is the wrong place. Simply because the language folder ideally should have no PHP code as you can't "translate" PHP code.
If the stemmers are made an own extension (eg file/library type), then it can still be bundled with a language pack, but developed in a different place (eg GitHub) and not in Crowdin.
But I'm going to repeat me here.

mbabker - comment - 28 May 2018

Simply because the language folder ideally should have no PHP code as you can't "translate" PHP code.

Depends what you consider the language folder as. If it's simply a collection of INI files to hold language string translations, then sure. The reality is the language system is more than translations and the language package is the most appropriate place for these additional tools/configurations (i.e. all the metadata that goes into the XML files to be parsed by the installer or the language library). So, are the translation teams solely responsible for string translation or are they managers for their language/locale as it relates to all aspects of the language system? If it's the former, then we really need to come up with a solution for how to address the missing pieces. If it's the latter, we need to improve our workflows and tools to ensure everyone can do everything that needs to be done.

Yes, I get that a lot of translators are not coders and can't maintain code solutions (i.e. the stemmers), but the translation teams also shouldn't be one person shops where the translator has to do everything.

brianteeman - comment - 28 May 2018

Good luck on making changes

infograf768 - comment - 29 May 2018

So, are the translation teams solely responsible for string translation or are they managers for their language/locale as it relates to all aspects of the language system?

Evidently the former: 95% translate and do not code at all. This is why we got so successful in getting so many languages since the creation of Joomla. I am astonished that anyone would even ask this question in 2018.
In any case, crowdin can't handle these stemmers, whoever codes the few of them we may get (and I challenge any coder not aware of the specificy of these to produce a working piece).

Listening to @Bakual and I suggestions would prove efficient and practical instead of dreaming about every translator (team when there is one) being also a coder AND a volunteer concerning language matters in Joomla...

infograf768 - comment - 29 May 2018

/media is not the right place for this. PHP files have no business in the /media folder, simply because it is code and not media.

I guess js IS code and CSS is not media but what would I know. The place is just easy to use if a TT who gets a new stemmer between releases wants to create a custom update pack to also install the stemmer using <media destination which in any case would be a rare situation.

mbabker - comment - 29 May 2018

Evidently the former: 95% translate and do not code at all. This is why we got so successful in getting so many languages since the creation of Joomla. I am astonished that anyone would even ask this question in 2018.

I ask because that sets the expectation for what we can expect the formally recognized translation teams to do. If it is ONLY translation of INI files then so be it, but at least make sure this is clear.

dreaming about every translator (team when there is one) being also a coder

Not all of us have this dream. I know full well where my skillset ends and what things I should and should not do. I also have the expectation that others communicate this information too. I don't expect anyone here to be able to do all the things on their own (even if it feels like I try to because nobody else will do something), but I do expect people to be able to work together to come up with solutions.

Joomla's language system is much more than being able to offer translations of text strings. Look at how much can be configured and controlled through the Localise class or the XML metadata schema for a language. Having the search system be able to properly support more than 2 languages is yet another continuation of using the language system to extend core's capabilities. If that means we bloat core and put stemmers for every language possible in the package because it's too much to ask for PHP code to be maintained outside core, so be it. If that means we find a way to include the code in language packs, so be it. But any proposal regarding improving the language system as a whole shouldn't be shot down simply because it requires more than the ability to translate from English to a preferred language, or we start toning down the language system to match the capabilities of the teams who are supporting that system (basically start dropping all capabilities that don't relate to string translation); I'm pretty sure nobody wants that one to happen.

Our only limitations are on what we will allow ourselves to do. IMO we keep making the wrong decisions as a project that limit what our software can do (much more outside this subject than this particular issue) and sooner or later that has to stop.

infograf768 - comment - 29 May 2018

@mbabker
Nobody here is shooting down improvements.
We just have to deal with the volunteers we have, their capacity to do some stuff and the tools at our disposal. These are our limitations. Placing these stemmers in core (and not in the lang packs) will NOT shoot down anything, but going on and on accusations towards people who just explain why we have to choose this as best solution will certainly push people to stop volunteering.

mbabker - comment - 29 May 2018

And that's why I started with the question of what the scope of the translation teams is. Feasibly, if we can ONLY expect of them to produce the INI and XML files found in their respective language's directory(s) then that's fine, but let's make that clear from the get go. That has NOT always been clear, even if it is for all intents and purposes the status quo.

So, are the translation teams solely responsible for string translation or are they managers for their language/locale as it relates to all aspects of the language system? If it's the former, then we really need to come up with a solution for how to address the missing pieces. If it's the latter, we need to improve our workflows and tools to ensure everyone can do everything that needs to be done.

Based on the commentary here the scope of work for a translation team and its translators is the first part of that question, they are responsible for string translation. So to me that means the distribution mechanism for stemmers is independent of language packages, full stop. They can be introduced as library or plugin extensions for all I care at this point, but there is no need to either introduce them into the language package or come up with some solution that puts PHP code in the media directory.

brianteeman - comment - 29 May 2018

Does that mean that we also have to remove xx_XX.localise.php ?

Hackwar - comment - 29 May 2018

Since that file only contains stuff from com_search (with one exception) and we are going to delete com_search (right?) That would be a yes

mbabker - comment - 29 May 2018

Well, not quite that easy. It supports transliteration and defining the plural suffix rules for language keys, so you'd need a way to do those two things without the PHP class, THEN we can remove that.

Hackwar - comment - 29 May 2018

That's what I meant with the exception.

infograf768 - comment - 29 May 2018

the localise.php file is used for other stuff too: it for example lets Persian language display Jalali dates.
I would suggest anyone who wants to delete anything to first think about the consequences...

brianteeman - comment - 29 May 2018

who said anything about deleting anything

ot2sen - comment - 29 May 2018

@Hackwar Thank you. Have been following this for the last two weeks and as @ggppdk says some 50 replies above, this is a major improvement.

Stemming and lemmatization are difficult topics and do need some linguistic insight. Where the latter gives better precision the former will be easier to implement. What confuses me is that the skills of the devoted Translation Teams should be the blocker. A quick look at the list of local coordinators and their teams tells that +20 teams will be more than capable of handling the programming part of this challenge.
(Whether they have the linguistic skills at the same level to verify the external stemming sources is hard to tell when not communicating directly with the teams).

Q: Have the Translation Teams been asked if they are up for the challenge?

If it gives trouble with the workflow, then perhaps turn to external sources and consider if using an own library bringing in powers of Hunspell and more filters would strengthen i18n of core and at the same time allow to not be dependent of the not defined skills of contributing Translation Teams.
Looking at elastic search docs https://www.elastic.co/guide/en/elasticsearch/guide/current/choosing-a-stemmer.html and at how others make use of APIs of Hunspell and filter https://developer.s24.com/blog/german_stemming_like_a_pro.html this whole stemming improvement might better be off the teams handling interface translation.

infograf768 - comment - 30 May 2018

A quick look at the list of local coordinators and their teams tells that +20 teams will be more than capable of handling the programming part of this challenge.

A very quick look indeed. I have talked to some experienced coders who don't understand anything to the stemmers. And, in any case, even if we have a few TTs who could do it, the practicality of adding them in the lang packs is what is at stake here (crowdin).
Let's not repeat the errors of the past about the so-called TT workgoup please.

As @Bakual , I and finally @mbabker wrote, these stemmers (by who ever codes them...) should be placed elsewhere in core.

Waiting now for @Hackwar to modify this PR after adding the 2 new ones.

Hackwar - comment - 30 May 2018

A very quick look indeed. I have talked to some experienced coders who don't understand anything to the stemmers.

Could you PLEASE, PLEASE stop acting as if translators and/or coders had to write their own stemmer? No one is asking that anyone in Joomla is getting a degree in linguistics, write several doctors thesises about stemmers and then create stemmers from the ground up. All I'm asking is, that people take existing stemmers and either create a file like this:

class En_GBLocalise extends LanguageBase
{
	protected $stemmer = null;

	public function __construct()
	{
		require_once JPATH_SITE . '/libraries/php-stemmer/src/English.php';

		$this->stemmer = new \Wamania\Snowball\English;
	}

	public function stem($word)
	{
		return $this->stemmer->stem($word);
	}
}

or copy the existing stemmer class and change the class name to work with our system. No one is expecting any coding work that is beyond that of highschool level of coding. The way you describe our translators, you make them out to be total idiots and I refuse to think of our them as such. Or any of our developers for that matter.
Yes, stemmers are freakingly complex, but that is what we have the resources of researchers for. They did all the researching work and the topic is nothing new (the porter stemmer algorithm if from the '80s) and thus most stemmers possible have been implemented in PHP, sometimes several dozen times. Here are all 13 stemmers that seem to be possible with the porter stemmer: [http://snowball.tartarus.org/]
No one is asking them to invent their own stemmer, as much as no one is asking every Joomla user to write their own PHP runtime.

Bakual - comment - 30 May 2018

@Hackwar It's still not possible to do that in Crowdin. So we need to place it elsewhere, not in the language folder.
PHP Code should be handled in a place that is suited for the task, GitHub is the best choice for us.
If the stemmer is placed as an own extension, then it's reasonable easy to include that extension in a language pack (into the package, not the actual language extension). Who then writes that stemmer class, I don't mind. If a TT is capable of doing it, fine. Let them do it on GitHub. In Crowdin even the best coder will not be able to do it, trust me on this.

Of course, we can also go the way JM suggests and just add additional stemmers to core whenever someone writes the stemmer class for a language. It's not like they will change often and thus need to be able to be updated independant from core (like language packs).

Imho best way is to have both. Add stemmers to core for as many languages as we can, and add an additional lookup folder which can be used by 3rd party "stemmer" extensions (be it a plugin or file extension I don't mind).

Again, the way this PR is set up with having the stemmers in the language folder will not work for reasons already laid out multiple times.

@mbabker Regarding your question what should be in a language folder: Given that Crowdin was chosen as our official translation tool by our management, we just shouldn't add more stuff to the folder that can't be managed in the chosen tool.
If we could properly manage PHP code in Crowdin, it would be no problem having it in the language folder. Because coders could jump in and help. But the way it is, even the localise.php file is a pitty to manage.
So yes, I would love to get rid of the localise.php file, but I don't see a way yet. Some stuff could be moved to an XML file easily, but others are harder to do.

brianteeman - comment - 30 May 2018

@Hackwar It's still not possible to do that in Crowdin. So we need to place it elsewhere, not in the language folder.

So how do we deal with the existing php file that is in the language pack?

infograf768 - comment - 30 May 2018

There is a big difference between a simple file on the model of localise.php as described above by @Hackwar which indeed TTs can handle, (and crowdin could to, specially if it is exactly the same code to just modify without adding or taking off) and the full stemmer. That file would need a name.

Here is what crowdin can do (grace to @Bakual)

If the localise.php is more complex, look at the French one which contains transliteration, or the Persian one with Jalali date, then we have to manually create the pack (or use com_localise)

Bakual - comment - 30 May 2018

@brianteeman Crowdin natively supports PHP files, but then only allows to translate things like variable values. You can't change code there.
For the localise.php we renamed it to localise.txt so Crowdin treats it as a text file. Now you can change the code, but only "sentence" by "sentence" or line by line (if no sentences are detected). When exported the file is renamed to the php extension again.
That works surprisingly well for simple classes, but as JM already wrote it already fails as soon as you have more complex ones like for French or Persian.
Main reason is that you can't add or remove lines.
So we have a partial solution (or better: a workaround) for most languages, but not all.

Now imagine that with class like the stemmer and you see why this can't work with Crowdin.

brianteeman - comment - 30 May 2018

Thank you for the detailed explanation

infograf768 - comment - 30 May 2018

Look here for a specific stemmer for Greek Language (Drupal is using almost the same)
https://github.com/magaras/greek_stemmer/blob/master/mod_stemmer.php

As you can see, not anyone (idiot or not) would be able to adapt that code to the way stemmers are coded elsewhere. for example
https://github.com/wamania/php-stemmer/blob/master/src/Russian.php
or here by @Hackwar based on the former porter-en stemmer we had previously in core and was apparently based on
https://github.com/camspiers/porter-stemmer/blob/master/src/Porter.php

class FinderIndexerLanguageen_GB extends FinderIndexerLanguage
{
	/**
	 * An internal cache of stemmed tokens.
	 *
	 * @var    array
	 * @since  2.5
	 */
	public $cache = array();

	/**
	 * Regex for matching a consonant.
	 *
	 * @var    string
	 * @since  2.5
	 */
	private static $regex_consonant = '(?:[bcdfghjklmnpqrstvwxz]|(?<=[aeiou])y|^y)';

	/**
	 * Regex for matching a vowel
	 *
	 * @var    string
	 * @since  2.5
	 */
	private static $regex_vowel = '(?:[aeiou]|(?<![aeiou])y)';

	/**
	 * Method to stem a token and return the root.
	 *
	 * @param   string  $token  The token to stem.
	 *
	 * @return  string  The root token.
	 *
	 * @since   2.5
	 */
	public function stem($token)
	{

[etc...]

These have the merit to exist in PHP.

joomla-cms-bot - change - 3 Jun 2018

Category

Administration com_finder Front End Language & Strings

⇒

Administration com_finder Front End External Library Composer Change Libraries

be0950d 3 Jun 2018

Merge branch '4.0-dev' of https://github.com/joomla/joomla-cms into j4finder_language

7101b9e 3 Jun 2018

Adding a bunch of stemmers to com_finder

Hackwar - change - 3 Jun 2018

Labels

Added: ?
Removed: ?

00fd497 3 Jun 2018

Fixing variable names in finder language stemmer files

Hackwar - comment - 3 Jun 2018

I've added several stemmers. Since the consensus seems to be that we have to provide them for all languages, I removed the stemmer from the language pack again and instead added it all to com_finder. I also added a new library with most of those stemmers. I'm wondering if we can remove the test files of php-stemmer in order to keep the size down...

I also worked on a greek stemmer, but due to time constraints, I can only link this gist here: https://gist.github.com/Hackwar/07d51d81899da96fc0d785c61ed3178f

72c5e99 3 Jun 2018

Codestyle

infograf768 - comment - 3 Jun 2018

You were suggesting above to add a method/class for each language in order to use the good stemmer automatically. I think it is a good idea as it could be treated like we do for en-GB.localise.php in crowdin.
That file would be added in the language pack when there is a stemmer for the language.
You chose to do it differently.

I suggest to move administrator/components/com_finder/helpers/indexer/language/en-GB.php to an en-GB.stem.php which would be in both site and admin en-GB folders.

Example of my reasons:
The en-US lang pack could also use the same stemmer and it would not need more than changing the class. In fact any en-XX could.
class FinderIndexerLanguageen_GB extends FinderIndexerLanguage
would be changed to
class FinderIndexerLanguageen_US extends FinderIndexerLanguage

For German, all packs (de-DE, de-AT, etc. would do the same for the German stemmer.
Etc.
Similar for fr-FR and fr-CA.

At looking at code, it does not look so complex to modify.

brianteeman - comment - 3 Jun 2018

Why in both the site and admin folders?

Hackwar - comment - 3 Jun 2018

You fiercely argued against adding PHP code to the language pack and now you are advocating putting duplicate files into even the same language pack?

See the chinese class on how to use this for chinese and taiwanese at the same time.

690b0dc 3 Jun 2018

Merge branch '4.0-dev' into j4finder_language

infograf768 - comment - 3 Jun 2018

Come on... there is a difference between a simple file we can deal with as we do for localise.php and the full stemmer.
In any case, the other simple solution is to add a new metadata in en-GB.xml, which would solve all.
stemmer>english</stemmer
as we already do for the calendar.

Bakual - comment - 3 Jun 2018

I'm wondering if we can remove the test files of php-stemmer in order to keep the size down...

Yep, you can remove them. Just add them to .gitignore so they don't get committed. We did that for other vendor files as well.

You fiercely argued against adding PHP code to the language pack and now you are advocating putting duplicate files into even the same language pack?

I wouldn't do that indeed. But instead of adding an "empty" class extending another, Maybe it could be nicer to add a setting to the language manifest XML. Eg stemmer="zh-CN" in the zh-TW language. The we wouldn't need a class for each dialect.

Bakual - comment - 3 Jun 2018

I haven't looked at the code in Detail. Did you remove the possibility of providing a stemmer as a 3rd party code completely?

mbabker - comment - 3 Jun 2018

I'm wondering if we can remove the test files of php-stemmer in order to keep the size down...

Yep, you can remove them. Just add them to .gitignore so they don't get committed. We did that for other vendor files as well.

Everything that's not the production code and the LICENSE file (so a package's composer.json, or test framework configs, or CI related configs, etc.) are all in .gitignore.

Hackwar - comment - 3 Jun 2018

Come on... there is a difference between a simple file we can deal with as we do for localise.php and the full stemmer.

That is the thing: These are the easy cases where I can simply use that third party library. if you look at the greek stemmer that I linked in the gist, you would see that all other stemmers (if there exist any) would be a lot more complex. Consider this rather the exception than the rule that the stemmer part of this file is so simple.

I wouldn't do that indeed. But instead of adding an "empty" class extending another, Maybe it could be nicer to add a setting to the language manifest XML. Eg stemmer="zh-CN" in the zh-TW language. The we wouldn't need a class for each dialect.

That would result in reverting a change that was in Smart Search earlier, where it only used the "Primary Language" to index the data. While that is indeed possible, that would make the language column inconsistent with the rest of Joomla. I'm very hesitant to do that...

I'll remove the test files.

9e4c799 3 Jun 2018

Removing test files of php-stemmer

joomla-cms-bot - change - 3 Jun 2018

Category

Administration com_finder Front End External Library Composer Change Libraries

⇒

Repository Administration com_finder Front End External Library Composer Change Libraries

4bb048d 3 Jun 2018

Merge branch 'j4finder_language' of github.com:Hackwar/joomla-cms into j4finder_language

Hackwar - comment - 3 Jun 2018

Test files are gone.

ggppdk - comment - 3 Jun 2018

This PR now has

// German language support class for the Finder indexer package.
class FinderIndexerLanguagede_DE extends FinderIndexerLanguage
{
...
	protected $stemmer = null;
...
	public function __construct()
	{
		$this->stemmer = new \Wamania\Snowball\German;
	}
...
	public function stem($token)
	{
		return $this->stemmer->stem($token);
	}
}

I thought we would have something like this inside de-DE.localise.php

	public function stem($token)
	{
		if ($this->stemmer === null)
		{
				// Also load stemmer class here (if not auto-loaded) and instatiate it
				$this->stemmer = new \Wamania\Snowball\German;
		}
		return $this->stemmer->stem($token);
	}

So that the language pack would

either us a stemmer include in joomla and just auto-load + instatiate it
or installs a 3rd party (as library or plugin) and then auto-load + instatiate it

How do you use a custom stemmer with this solution ?

I was only against duplicating the code of 3rd party stemmer e.g. wamania stemmer inside the language files and asking translators to maintain it

Hackwar - comment - 3 Jun 2018

That file is not only to stem words, but also to tokenise the text. If anybody cared to look at what I wrote or at the code, you would notice that for example a greek stemmer is not supported by php-stemmer and would need a specific implementation in whatever language file we would have.

Since apparently PHP code in the language packs is impossible, this is the next best solution. No one talked about duplicating any code (at least not more than necessary). And you can still provide 3rd party stemmers by simply including a properly named class. I will not implement special cases here and there.

So either accept that Joomla has to ship all stemmers out of the box or make them all external, but don't mix that up. The result would be what we had the last few years, where actually only 2 stemmers were included and 99.999% of all users out there didn't use a stemmer or com_finder at all.

Decide what you want: Stemmers as part of the language packs or "hardcoded" in com_finder. Or get someone else to do the work. I will not again invest 9 years fighting windmills for a feature like this.

Bakual - comment - 3 Jun 2018

That would result in reverting a change that was in Smart Search earlier, where it only used the "Primary Language" to index the data. While that is indeed possible, that would make the language column inconsistent with the rest of Joomla. I'm very hesitant to do that...

Not sure I understand that. I'm not saying to just looking for eg a de class for all the german based languages/dialects. My thought was that the language pack could specify in its XML if eg the de-DE one is used for de-CH so we don't have to have an empty class for de-CH. Or in your example where zh-TW and zh-CN share the same stemmer class.
Of course we could also go the route we do with the calendar, where we first look for the full language-COUNTRY code and if not present fall back to the language code. That way still allows to have dialect specific file if needed and if not takes the "main" file. That obviously fails if the same stemmer can also be used by different languages in case the languages are close enough. Do you know if that is the case? Then the XML approach would still work but the "calendar" approach would not.

Anyway, this is also something which can be done after this PR is merged. So nothing urgent.

infograf768 - comment - 4 Jun 2018

Anyway, this is also something which can be done after this PR is merged. So nothing urgent.

Not sure I agree with this. Let's rather make it work fine from the beginning as there is always the possibility that no one will take the time to understand the code as well as Hannes does and therefore we would be left, if I understand well, with adding specific files for each new language variant instead of re-using existing stemmers by a simple metadata in the xml.

The Greek case has to be taken care of specifically. I have not seen any other php stemmer at this stage on the Net. Therefore we would be safe for quite a long time.

Bakual - comment - 4 Jun 2018

Not sure I agree with this. Let's rather make it work fine from the beginning as there is always the possibility that no one will take the time to understand the code as well as Hannes does

It's just the way the stemmer class is loaded. You don't have to understand the stemmer class itself to modifiy that part ?

nikosdion - comment - 10 Jun 2018

The Greek stemmer in the Gist is unusable because of character conversion issues. It would appear that Drupal is using the old ISO-8859-7 character encoding for Greek, whereas Joomla has moved fully to UTF-8 for more than a decade. I am working on the Greek stemmer. Hopefully I'll be done today or tomorrow and I'll PR the changes to Hannes' branch.

58e58d9 10 Jun 2018

Fixing the Greek stemmer

866463f 10 Jun 2018

Fixing the Greek stemmer

f6379cc 10 Jun 2018

Fixing the Greek stemmer

6a9feca 10 Jun 2018

Fixing the Greek stemmer

Hackwar - comment - 10 Jun 2018

Thank you very much! ?

cfc9e74 10 Jun 2018

Fixing the Greek stemmer

3ec794d 10 Jun 2018

Fixing the Greek stemmer

e6a1da8 10 Jun 2018

Fixing the Greek stemmer

b994026 10 Jun 2018

Fixing the Greek stemmer

cdf44ab 12 Jun 2018

Merge pull request #23 from nikosdion/feature/j4finder_language

Hackwar - comment - 12 Jun 2018

I've merged in the greek stemmer for now. If you encounter any issues, please open a new PR. Thank you for that, @nikosdion!

d76986d 12 Jun 2018

Merge branch '4.0-dev' into j4finder_language

infograf768 - comment - 12 Jun 2018

To test Greek one has to install the 3.8.8 Greek language pack from Install ->Extensions
http://joomlacode.org/gf/download/frsrelease/20834/164487/el-GR_joomla_lang_full_3.8.8v3.zip

Hackwar - comment - 12 Jun 2018

Ok, so phpcs seems to not be UTF8 compatible... I modified that one regex to go across multiple lines and I checked that each line was shorter than 150 chars, but it now warns that each of those lines is more than 200 chars. Don't know yet what we can do here.

Hackwar - comment - 12 Jun 2018

I'm not sure if we simply have to add --encoding=utf-8 to the call of phpcs. From what I read, it should actually use utf8 by default, but still...

brianteeman - comment - 12 Jun 2018

The greek stemmer is GPL3 not 2 - someone better than me needs to decide if thats ok to be included in core as the licences are not compatible

Hackwar - comment - 12 Jun 2018

 * @license     GNU General Public License version 2 or later; see LICENSE.txt

That is in each of our files. Notice the or later. That should still be compatible.

brianteeman - comment - 12 Jun 2018

My understanding, and I could be wrong, is that by adding a gpl3 file into the core will mean that the core becomes gpl3. iirc thats why @nikosdion had to change fof to gpl2

nikosdion - comment - 12 Jun 2018

This is correct. GPLv3 is more restrictive than GPLv2 in that it has the patent clause. Therefore mixing GPLv3 with GPLv2 means that the whole lot is GPLv3. There's a reason I have been telling everyone that Joomla! needs to move to GPLv3 since 2011. When we had the Joomla! 4 kickoff sprint in Denmark in July 2015 I also raised the same concern and I was told by Ronni that he'd ask the lawyers but probably there was no problem doing that. Three years later I guess nobody asked no lawyer and we're still kicking the same can down the same road.

It would be utterly disappointing to the Greek community for Joomla! 4 not to support Greek in Finder because someone didn't pick up the darned phone and / or had the balls to upgrade the license. More so when the consensus is that that the old search will eventually be going away. Without Greek support in Finder and no other search option Joomla! would be a really hard sell in Greece, especially in the public sector where it's being used avidly (and holding back the tide of Joomla-to-WordPress conversion that had started taking place in this country since 2014).

Or we'll end up packing up the Greek stemmer as a Greek community maintained package, licensed under GPLv3, tainiting the core once you install it. This is not a big problem. There are so many 3PD extensions already GPLv3 licensed which use or extend Joomla! that it's effectively unavoidable that your real world Joomla! installation is de facto tainted GPLv3. That is to say, sticking with GPLv2+ in the core is kinda daft.

ot2sen - comment - 12 Jun 2018

With the "and later" we already have then the path to GPLv3 shouldn´t be problematic.
They are compatible upwards. See also the note 3 below the chart
https://www.gnu.org/licenses/gpl-faq.html#AllCompatibility

brianteeman - comment - 12 Jun 2018

sorry to have been the bearer of that news. for the record my understanding would be that there is no issue changing it to gpl3 but ianal

nikosdion - comment - 12 Jun 2018

I agree with all of you :) That's the point I have been making for seven years. All it takes is someone from OSM picking up the phone and asking the lawyers. If anyone of you can navigate the OSM quasi-corporate structure and talk to someone who can actually do that it'd be just plain amazing. I obviously cannot for reasons which are best left unspoken, especially in a public forum.

dgrammatiko - comment - 12 Jun 2018

Dumb question: why the stemmer is not a PR against the wamania (whatever is the name) library? That could fix (?) the license incompatibilities...

nikosdion - comment - 12 Jun 2018

You cannot submit code licensed GPLv3 code into a library licensed MIT. You can use MIT code with GPLv3 code but the end result is GPLv3, not MIT.

brianteeman - comment - 12 Jun 2018

Note that drupal is in the exact same position in being licences gpl2 or later (the file referenced above is not part of drupal just written for drupal)

mbabker - comment - 12 Jun 2018

I wouldn't hold my breath on legal questions. I raised one in January, it never made it to the lawyers. And I was on the board at the time.

infograf768 - comment - 13 Jun 2018

@nikosdion
You have an urgent mail

infograf768 - comment - 13 Jun 2018

Tested this and I get weird results using Greek language on multilingual site.
I first had to modify the last Greek pack as its site el-GR.localise.php is corrupted and still uses JString. Just took off the corrupted part which was dealing with transliteration and useless for this test.
I do not get highlighting when not entering the full word and also no suggestions.
It looks like the db is populated.

I have a Greek Article with a text.
Σχόλιο: Στην καθομιλουμένη, πολύ συχνά οι όροι site και online παραμένουν αμετάφραστοι από τα αγγλικά.

I search for καθο
No suggestion when I was expecting καθομιλουμένη as suggestion
This is the result:

EDIT: We should not have got any result (same type of result in French for example when trying to search for an unexisting word)

If I use στην, still no suggestion. I get:

infograf768 - comment - 13 Jun 2018

In English, no suggestions either but when the word exists the full term is at least highlighted:

nikosdion - comment - 13 Jun 2018

@infograf768 Do you know how I can replace the xx-XX.localise.php file on CrowdIn? I don't see any option and using the Translate option, predictably, does not let me do it. I only have Proofreader privileges if that's relevant.

Bakual - comment - 13 Jun 2018

@nikosdion Unfortunately you can't upload the translated file. You need to "translate" it using the webgui.
The localsie.php is a beast in Crowdin since we don't really "translate" that php file. We had to use a workaround using txt files so you can adjust the code itself. But Crowdin does that on a sentence by sentence base. So in the greek case it gets messy with the added methods.

infograf768 - comment - 13 Jun 2018

@nikosdion
One has to use the localise.php from a former release. My guess is that the 3.8.3.4 release contains the correct file.
Then you can't use crowdin to make the pack. Or you do it manually or you can use com_localise, the correct files having been added in their folders before creating the pack.

nikosdion - comment - 13 Jun 2018

I asked the other Greek community moderators for help since they have more experience with the Beast :)

Meanwhile I have found the cause of bug that JM spotted in the Greek stemmer. The special upper/lowercase conversion was completely hinged on the use of ISO-8859-7 and wouldn't work with UTF-8. I am rewriting that code from scratch and also handling several odd cases where we have both diaeresis and accent on the same vowel (e.g. ευφυΐα, Μαΐου, καταπραΰνω) that the original code failed to handle. Now, that was the easy part.

The hard part is that the tokenizer coughs, barfs, rolls over and dies when confronted with Greek punctuation such as Greek double quotes (« and ») or the Greek upper dot (·) which is roughly what semicolon is in English. I am still trying to solve this. Greek is a complicated language :p

Once I am done I will send a PR to Hannes' repo.

As for testing, I would much rather use actual, real world Greek (with all the grammar and spelling mistakes) from news sites such as in.gr and ert.gr. So, in the interest of beating the living ghost out of the Greek stemmer I ended up using these articles:

http://www.aplotaria.gr/dialitika/ An article on the use of diaeresis with the mothers of all exceptions in their use.
http://www.ert.gr/eidiseis/ellada/kinonia/panelladiki-epitropi-ton-mplokon-symmetochi-sta-ergatika-syllalitiria/ A well structured, random article from the Greek state broadcaster with plenty of Greek-specific punctuation.
http://www.in.gr/2018/06/13/politics/vouli/sta-skaria-protasis-dyspistias-kata-tis-kyvernisis-apo-ti-nd/ A random news article with typos abound (Latin letters making it into Greek words, apostrophes instead of accents etc).

2dd6ecc 13 Jun 2018

Greek stemmer fix

Hackwar - comment - 13 Jun 2018

And that is why the stemmer and tokeniser is part of the language class of Finder, so that you can modify the tokeniser for each language. ?

Regarding the license: @nikosdion, which code did you use for the stemmer? This here seems to be the Drupal stemmer and is licensed as GPLv2: https://github.com/magaras/greek_stemmer
I understand the interest in GPLv3, but when push comes to shove, I'd rather find a way to keep this PR GPLv2 in order to get it accepted, instead of keeping it in limbo and hoping for conversion to GPLv3...

Anybody got an idea about the phpcs? I got no working setup here right now and don't want to mess around with a few dozens commits to test this in this PR...

fb32ca0 13 Jun 2018

Greek stemmer fix

nikosdion - comment - 13 Jun 2018

I had used the one from your Gist. Give me an hour, I will port back the GPLv2 stemmer to solve the license issue.

3f13ff3 13 Jun 2018

Greek stemmer fix

nikosdion - comment - 13 Jun 2018

I sent you the PR. Apparently the actual stemmer (the "black magic" that converts words to stems) is common in all versions of the Drupal stemmer plugin. The code I ripped and wrote from scratch was the only difference in the GPLv3 version of the stemmer. Since that positively GPLv3 licensed code is no longer present we can definitely license the Greek stemmer under GPLv2+ and everyone's happy. Amen.

Hackwar - comment - 13 Jun 2018

e1fc4b3 13 Jun 2018

Merge pull request #24 from nikosdion/feature/j4finder_language

infograf768 - comment - 13 Jun 2018

Reminder: I have similar issues with French and Greek in the sense that we do not get suggestions.
Therefore except for the fact that the stems are apparently correctly saved in the db (while a bit strange for French, can't say for Greek), it is rather difficult if not impossible to test in the real world.
Example of stems in French:

Affich sur looks pretty weird to me.

infograf768 - comment - 13 Jun 2018

@nikosdion
FYI: That one is GPL2 apparently and it looks like it deals with utf8/ISO issue. It is a bit different from the Drupal one.
https://github.com/magaras/greek_stemmer/blob/master/mod_stemmer.php

infograf768 - comment - 13 Jun 2018

[13-Jun-2018 18:08:21 Europe/Berlin] PHP Notice: Undefined index: ΙΖΟΥΜΕ in /Applications/MAMP/htdocs/testnew/joomla40/administrator/components/com_finder/helpers/indexer/language/el-GR.php on line 189
[13-Jun-2018 18:08:21 Europe/Berlin] PHP Notice: Undefined index: ΙΖΩ in /Applications/MAMP/htdocs/testnew/joomla40/administrator/components/com_finder/helpers/indexer/language/el-GR.php on line 189
[13-Jun-2018 18:08:21 Europe/Berlin] PHP Notice: Undefined index: ΙΣΤΗ in /Applications/MAMP/htdocs/testnew/joomla40/administrator/components/com_finder/helpers/indexer/language/el-GR.php on line 276
[13-Jun-2018 18:08:21 Europe/Berlin] PHP Notice: Undefined index: ΙΣΤΑ in /Applications/MAMP/htdocs/testnew/joomla40/administrator/components/com_finder/helpers/indexer/language/el-GR.php on line 276

nikosdion - comment - 13 Jun 2018

Suggestions: not my monkey, not my circus. I can only give you the stemmer. Finder is a beast I have kept well away from to save my sanity.

GPLv2: we have already worked that out. See the commits in the PR in Hannes' repository. And no, the one you found -which is the same Hannes told me about a few messages above yours- does NOT deal with UTF-8 vs ISO-8859-7. It simply uses a different way to change the case of tokens after it converts them to ISO-8859-7. But don't worry, I wrote my own code for that which is more forwards compatible and it's GPL2 and shared copyright with OSM since I've signed the Joomla! Contributor Agreement back when Elin was president of the OSM (circa 2009 or 2010-ish?).

Regarding the notice you get: Jean-Marie, you had me read a PhD thesis, analyze some JavaScript code and read up on my Greek grammar, something I last did 20 years ago. Phew. The good news is that I can now understand what is going on! The original stemmer from the 2006 thesis of Georgios Ntais was incomplete in some of the edge cases. It was translated verbatim to PHP in 2009 by a different person. Since then other developers, not connected to the first two, have added to it, apparently without understanding how the original code worked. Therefore they monkey-copied the code to apply the rules without realizing that the $step1list is, in fact, a list of exceptions which is to be applied only to a specific ruleset (the one marked as Step 1, not the ones marked S1 through S9). I will have to fix that code and do another PR to Hannes' repo. Please note that the fix only really addresses the PHP Notice, it does not have any impact on the operation of the stemmer (see below for the technical details).

If you're wondering why the Greek stemmer works as it is, it's because of PHP's wonderful inconsistencies. First of all $myArray[$nonExistentIndex] returns null and issues a PHP Notice instead of throwing an error. Second, $string . null results in $string since null when type-casted to a string is an empty string, therefore the concatenation with null yields the original string. Put together, $string . $myArray[$nonExistendIndex] returns $string and throws a PHP Notice. The Notice is what JM reported. But since its effects are transparent to the functionality of the code I didn't notice any problems when testing.

b2f6758 13 Jun 2018

Fixing the Greek stemmer

nikosdion - comment - 13 Jun 2018

@Hackwar You have a new PR from your now favorite pest of a developer :p

f39b868 13 Jun 2018

Merge pull request #25 from nikosdion/feature/j4finder_language

Hackwar - comment - 13 Jun 2018

You are indeed my favorite developer, since you get stuff done. ? Now we just need to get the code to pass the codestyle and I need to make a bunch of additional changes. After some considerations, I've decided to revert some changes and to work with the result of Locale::getPrimaryLanguage() for the stemmer, tokeniser and the value stored for the token in the DB. That makes stuff a bit easier for everybody...

brianteeman - comment - 13 Jun 2018

@Hackwar can you let us know when you have done those changes and it is ready for testing again

d4d8f3a 13 Jun 2018

Changing language classes back to primary language selection in com_finder

cf4f640 13 Jun 2018

com_finder: simplifying the storage code in the tokeniser cache

3524b00 13 Jun 2018

Adding encoding parameter to phpcs

joomla-cms-bot - change - 13 Jun 2018

Category

Administration com_finder Front End External Library Composer Change Libraries Repository

⇒

Repository Administration com_finder Front End External Library Composer Change

Hackwar - comment - 13 Jun 2018

@brianteeman that would be now.

I've reverted the removal of FinderIndexerHelper::getPrimaryLanguage() and I'm using that language tag now for the terms. I also did some further optimisations by removing all terms that validate to false. That should be the empty string and 0 if I am right. Maybe with PHP, this could also be the case for the string 'false', but I didn't check that, to be honest. (The terms are send through array_filter())

I also added the attributes $language and $spacer to the FinderIndexerLanguage class. $language contains the language string that is used in the terms table and will be used in the stop word table. $spacer is meant to be used by languages that don't use a space for character separation.

There also were a bunch of bugs in the code, some of them actually already in the original code of com_finder. In FinderIndexerHelper::tokenize() I fixed several issues with the spacer in phrases and also made the multi-language implementation actually work for all languages.

The issues with the suggestions are also fixed. They did not work because I removed the FinderIndexerHelper::getPrimaryLanguage() method and the code still asked for it.

Last but not least, I tried fixing the phpcs issues here, but no luck so far. If anybody could have a look at the issues with the line length? phpcs seems to not be able to correctly count utf8 chars here.

Hackwar - comment - 13 Jun 2018

In terms of features and implementation, I would call this PR complete. Additional language packs should be added in separate PRs. So if you guys don't find any bugs, I would be really happy if we could get this one merged, too.

brianteeman - comment - 13 Jun 2018

Great - thanks for the update

Quy - comment - 13 Jun 2018

Last but not least, I tried fixing the phpcs issues here, but no luck so far. If anybody could have a look at the issues with the line length? phpcs seems to not be able to correctly count utf8 chars here.

Pinging @photodude

photodude - comment - 13 Jun 2018

What exactly are these "issues" which are difficult to fix? Most of this seems very straightforward...
Note: I would not be surprised if PHPCS 1.5.x core line length check has issues with UTF8, but by my count most of these lines are still too long as we enforce 150 characters.

(sorry our PHPCS2.x and 3.x ruleset versions are not ready yet; unfortunately, I'm a little swamped with learning Finite Element Analysis and haven't been able to dig into finishing the work there and the code style changes here since PHPCS 2.x+ is much better at catching CS errors.)

FILE: ...cms/administrator/components/com_finder/helpers/indexer/language/el.php
--------------------------------------------------------------------------------
FOUND 2 ERROR(S) AND 16 WARNING(S) AFFECTING 18 LINE(S)
--------------------------------------------------------------------------------
  89 | WARNING | Line exceeds 150 characters; contains 6578 characters
 105 | WARNING | Line exceeds 150 characters; contains 167 characters
 107 | WARNING | Line exceeds 150 characters; contains 200 characters
 144 | WARNING | Line exceeds 150 characters; contains 247 characters
 174 | WARNING | Line exceeds 150 characters; contains 238 characters
 190 | WARNING | Line exceeds 150 characters; contains 225 characters
 263 | WARNING | Line exceeds 150 characters; contains 286 characters
 266 | ERROR   | Please start your comment with a capital letter; found "// for words like ΠΛΟΥΣΙΟΚΟΡΙΤΣΑ, ΠΑΛΙΟΚΟΡΙΤΣΑ etc"
 375 | WARNING | Line exceeds 150 characters; contains 589 characters
 472 | WARNING | Line exceeds 150 characters; contains 371 characters
 512 | WARNING | Line exceeds 150 characters; contains 167 characters
 533 | WARNING | Line exceeds 150 characters; contains 1025 characters
 559 | WARNING | Line exceeds 150 characters; contains 241 characters
 560 | WARNING | Line exceeds 150 characters; contains 200 characters
 669 | WARNING | Line exceeds 150 characters; contains 251 characters
 688 | WARNING | Line exceeds 150 characters; contains 420 characters
 765 | WARNING | Line exceeds 150 characters; contains 866 characters
 886 | ERROR   | Doc comment for var &$w_CASE does not match actual variable name $w_CASE at position 2

infograf768 - comment - 14 Jun 2018

Congratulations!!
A lot of progress.
Here I do now get for Greek some suggestions.

Will let Greek speaking people check if that is OK.

In French I also get some suggestions, but I have an issue when hitting search. Maybe someone can solve that.
This screenshot shows the suggestions AND the phrase it corresponds to.

The problem is that if I choose the suggestion, I do not get any result.

Now, if I enter the full phrase as it is in the article, i.e. "Association de menu", I get:

But this is not constant. Here I do get results:

10d252e 14 Jun 2018

Codestyle

Hackwar - comment - 14 Jun 2018

I fixed the codestyle issues and broken up the long lines into lines of 150 characters length. The modification that I did to the build.xml seems to also be correct, but I don't know how this actually gets active, so that phpcs is called with the encoding parameter. Tested this locally and with that parameter it works fine.

Hackwar - comment - 14 Jun 2018

@infograf768 I don't know how to replicate this. In the english data that I have, this works fine. Can you test this without this PR?

infograf768 - comment - 14 Jun 2018

I am not sure I can really test on 4.0. Will do later.
My comparison will be with 3.x using the French stemmer.
And it works fine there (independant from site language, tested on monolanguage site)

and result

infograf768 - comment - 14 Jun 2018

Tested using French stemmer without your patch.
Also works fine apparently

joomla-cms-bot - change - 14 Jun 2018

Category

Administration com_finder Front End External Library Composer Change Repository

⇒

Unit Tests Repository Administration com_finder Front End External Library Composer Change

Hackwar - change - 16 Jun 2018

Labels

Added: ?

joomla-cms-bot - change - 16 Jun 2018

Category

Administration com_finder Front End External Library Composer Change Repository Unit Tests

⇒

Unit Tests Repository Administration com_finder Front End

Hackwar - comment - 16 Jun 2018

@infograf768 Please test this again. I think I found your issue.

Hackwar - change - 16 Jun 2018

Labels

Removed: ?

infograf768 - comment - 17 Jun 2018

@Hackwar
It does work much better indeed by displaying results when searching for association de menu (see what I got before).
I still get one weird result in French as the term associer is highlighted (as it does, wrongly, in 3.x but not without your PR in 4.0).

Remaining important issue in any language:
If I search for an unexisting term, instead of getting a message telling that the term could not be found, I get this. Imho that is an important bug:

NOTE as a reminder unrelated to this patch but addressed to whoever knows how to:
The search field loses focus after typing 3 letters (Firefox, Safari) and it is a PITA.

infograf768 - comment - 17 Jun 2018

It would be good that other people than I test this stuff in real world. "Approving" is evidently useless...

Hackwar - comment - 17 Jun 2018

@infograf768 That is intended behavior. That is the power of the stemmer.

infograf768 - comment - 17 Jun 2018

@Hackwar
What is the intended behavior? Finding associer (then OK) or the big bug when the term does not exist at all?

tested again with a term that does not exist at all on this test site:

I doubt this the power of the stemmer... ?

Hackwar - comment - 17 Jun 2018

"Associationner" would have been the stemmer, since "Associati" or something similar would have been the stem. But indeed, there is a check missing. I thought the whole check was unnecessary, but we actually need a slightly different check.

Hackwar - comment - 18 Jun 2018

I fixed the searching. Please test again.

Hackwar - comment - 18 Jun 2018

I don't know what the issue with the CI failure is. Does not look like an issue on my end.

carlitorweb - comment - 18 Jun 2018

@Hackwar can you give me an update for test this? is a lot to read!!, and so far I read, seen the translation team is the one mainly need to test this, right?

Hackwar - comment - 18 Jun 2018

How to test

Create a multi-lingual website, enable the Smart Search content plugin.
Create an article with some text. Make sure that the text contains a word in singular (like "extension") and that it is set to a language.
Create another article with some text. Make sure that that text contains a word with the same stem (like "extensions" or "extensively") and that it is set to a language.
Search in the frontend for your first word. See that it only returns that one result.
Apply patch and rebuild the index in the backend.
Search in the frontend for your first word and see that it returns both articles.

It would be best to not use english or french, since those 2 languages were also supported in the past. If you use english or french, you have to know that your results in step 4 depend on the setting for the stemmer in the configuration of the component.

I hope that covers everything?

carlitorweb - comment - 18 Jun 2018

Thank you!

infograf768 - comment - 19 Jun 2018

I fixed the searching. Please test again.

Solved. Now searching for an unexisting term displays correctly.

I confirm that the search for a term existing in many languages only returns a result for the language in use.

Looks fine to me now concerning stemmer.

infograf768 - test_item - 19 Jun 2018 - Tested successfully

infograf768 - comment - 19 Jun 2018

I have tested this item ✅ successfully on 10d252e

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.}

carlitorweb - comment - 19 Jun 2018

I left something out?. When I rebuild the index, after applied the PR, I got this:

UPDATE: The JPatch Tester, lose some diff. This is a long PR, and this tool is not recommended for this scenario.

Hackwar - comment - 19 Jun 2018

@carlitorweb this sounds like your branch did not check out properly. Or did you recreate/change the composer files?

carlitorweb - comment - 19 Jun 2018

Search in the frontend for your first word and see that it returns both articles

@Hackwar no return both article only returns a result for the language in use. But this is the expected behavior

Tested with es and it

carlitorweb - test_item - 19 Jun 2018 - Tested successfully

carlitorweb - comment - 19 Jun 2018

I have tested this item ✅ successfully on 10d252e

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.}

infograf768 - comment - 20 Jun 2018

BTW: tested with Serbian sr-YU (after correcting the localise.php because of JString) and it works fine although we do not have any stemmer for it.

Looks like we have some conflicting files now but for me this can go RTC.

infograf768 - change - 20 Jun 2018

Status	Pending	⇒	Ready to Commit
Labels

infograf768 - comment - 20 Jun 2018

RTC after correcting conflicts.

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/20391.}

Hackwar - change - 20 Jun 2018

Labels

Added: ?

Hackwar - comment - 20 Jun 2018

I've updated the composer files and hopefully those are correct now. So unless someone has an issue with the composer stuff, we can indeed merge this now.

joomla-cms-bot - change - 20 Jun 2018

Category

Administration com_finder Front End Repository Unit Tests

⇒

Unit Tests Repository Administration com_finder Language & Strings Front End

wilsonge - close - 20 Jun 2018

wilsonge - merge - 20 Jun 2018

wilsonge - change - 20 Jun 2018

Status	Ready to Commit	⇒	Fixed in Code Base
Closed_Date	0000-00-00 00:00:00	⇒	2018-06-20 10:59:00
Closed_By		⇒	wilsonge
Labels	Added: ?

wilsonge - comment - 20 Jun 2018

Awesome work! Thanks Hannes and to all the people who got involved and tested!

Hackwar - comment - 20 Jun 2018

Thanks for merging!

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#20391] - [4.0] Using language specific tokeniser and stemmer for com_finder

Tokenisation

Stemming

Providing language specific tokeniser and stemmer

How to test

How to test

Add a Comment