? Failure

User tests: Successful: Unsuccessful:

avatar chrisdavenport
chrisdavenport
17 Oct 2016

Common words (also known as "stop words") in Smart Search were being assigned too much weight in search queries and were not flagged as being common. The reason turned out to be that the default language "*" was not being recognised as matching the language code used in the common words table ("en"). Thus common words were simply not being recognised as such.

Note that this only affects English because no other language has common words at the present time (unless you added them to the database yourself).

Summary of Changes

This PR amends the FinderIndexerHelper::isCommon method so that "*" is recognised as shorthand for the default language.

Testing Instructions

It's difficult to construct a search query where it makes a significant difference to the outcome. In fact, I've given up trying! The easiest way is to simply look in the #__finder_terms table and notice that the word "the" has dropped from a weight of 0.2 to 0.025 and it has a 1 in the "common" column. Prior to applying this PR there wouldn't be any terms flagged as common.

Note that you will need to purge and re-index after applying this PR. Re-indexing without purging will not force the weights to be recalculated.

Fixing this bug is really about correctly labelling common words so as to pave the way for more sophisticated ranking algorithms in the future.

Documentation Changes Required

None. This is a bug fix.

avatar chrisdavenport chrisdavenport - open - 17 Oct 2016
avatar chrisdavenport chrisdavenport - change - 17 Oct 2016
Status New Pending
avatar joomla-cms-bot joomla-cms-bot - change - 17 Oct 2016
Labels Added: ?
avatar joomla-cms-bot joomla-cms-bot - change - 17 Oct 2016
Category Administration Components
avatar brianteeman
brianteeman - comment - 17 Oct 2016

Note that you will need to purge and re-index after applying this PR.
Re-indexing without purging will not force the weights to be recalculated.

I assume "purge" means using the "Clear Index" button?

Will this not need to be documented in the upgrade notes as people dont
usually "clear the index" do they (I dont use Smart Search)?

On 17 October 2016 at 23:26, Chris Davenport notifications@github.com
wrote:

Common words (also known as "stop words") in Smart Search were being
assigned too much weight in search queries and were not flagged as being
common. The reason turned out to be that the default language "*" was not
being recognised as matching the language code used in the common words
table ("en"). Thus common words were simply not being recognised as such.

Note that this only affects English because no other language has common
words at the present time (unless you added them to the database yourself).
Summary of Changes

This PR amends the FinderIndexerHelper::isCommon method so that "*" is
recognised as shorthand for the default language.
Testing Instructions

It's difficult to construct a search query where it makes a significant
difference to the outcome. In fact, I've given up trying! The easiest way
is to simply look in the #__finder_terms table and notice that the word
"the" has dropped from a weight of 0.2 to 0.025 and it has a 1 in the
"common" column. Prior to applying this PR there wouldn't be any terms
flagged as common.

Note that you will need to purge and re-index after applying this PR.
Re-indexing without purging will not force the weights to be recalculated.

Fixing this bug is really about correctly labelling common words so as to
pave the way for more sophisticated ranking algorithms in the future.
Documentation Changes Required

None. This is a bug fix.

You can view, comment on, or merge this pull request online at:

#12450
Commit Summary

  • Smart Search common words assigned too much weight

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#12450, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABPH8QW-vp97S_t17Vnw8Vel45Q0FDlzks5q0_YhgaJpZM4KZLYS
.

Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
https://brian.teeman.net/ http://brian.teeman.net/

avatar brianteeman brianteeman - change - 17 Oct 2016
Category Administration Components Administration Components Search
avatar chrisdavenport
chrisdavenport - comment - 18 Oct 2016

Yes, purge = clear index. It's still --purge on the cli.

For anything other than testing this PR re-indexing isn't important. As I noted in the summary, I very much doubt anyone will notice the difference anyway. I wouldn't want people to think they have to re-index, but the next time they do, they'll get the new weights.

avatar piotr-cz
piotr-cz - comment - 18 Oct 2016

I know it's OT, but how is the table #__finder_terms_common populated?
AFAIK only on new installation (see com_finder installation file), but I think this should be part of language xml manifest.
There is also no user interface to manage the list.

avatar chrisdavenport
chrisdavenport - comment - 18 Oct 2016

@piotr-cz You are correct and I would like to change that.

There needs to be a mechanism for including a common words table in language packs. We also need a mechanism for overriding entries for a specific website so that it can be tuned for the particular statistical distribution of words found on that site. And we need a more sophisticated mechanism for site administrators to influence the ranking calculations. Do we even need a common words database table? We could just load the common words into memory from a JSON file in the language pack as needed, then load an override file from another location to override/customise it. There are many possibilities and your suggestions are welcome.

But, one step at a time. :-)

avatar alikon
alikon - comment - 23 Oct 2016

before patch
bfp
after patch
afp

avatar alikon alikon - test_item - 23 Oct 2016 - Tested successfully
avatar alikon
alikon - comment - 23 Oct 2016

I have tested this item successfully on 590939b


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

avatar brianteeman brianteeman - test_item - 14 Nov 2016 - Tested successfully
avatar brianteeman
brianteeman - comment - 14 Nov 2016

I have tested this item successfully on 98ef480

database changes observed


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

avatar brianteeman brianteeman - change - 4 Dec 2016
The description was changed
Easy No Yes
avatar brianteeman brianteeman - change - 4 Dec 2016
Labels Removed: ?
avatar brianteeman brianteeman - edited - 4 Dec 2016
avatar joomla-cms-bot joomla-cms-bot - change - 4 Dec 2016
Category Administration Components Search Administration com_finder Components Search
avatar franz-wohlkoenig
franz-wohlkoenig - comment - 3 Jan 2017

I have tested this item ? unsuccessfully on 98ef480

Test on Joomla! 3.7.0-alpha1. Looked for german-lang "die" (similar to "the"), before and after Patch weight: 0.2.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

avatar franz-wohlkoenig franz-wohlkoenig - test_item - 3 Jan 2017 - Tested unsuccessfully
avatar piotr-cz
piotr-cz - comment - 3 Jan 2017

@franz-wohlkoenig I'm afraid that this works only for English language at the moment, see #12450 (comment)

avatar franz-wohlkoenig
franz-wohlkoenig - comment - 3 Jan 2017

Thanks for information @piotr-cz

avatar franz-wohlkoenig franz-wohlkoenig - change - 8 Apr 2017
Easy Yes No
avatar brianteeman
brianteeman - comment - 19 May 2017

@franz-wohlkoenig could you reset your test result please


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

avatar franz-wohlkoenig
franz-wohlkoenig - comment - 20 May 2017

I have not tested this item.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12450.

avatar franz-wohlkoenig franz-wohlkoenig - test_item - 20 May 2017 - Not tested
avatar franz-wohlkoenig
franz-wohlkoenig - comment - 20 May 2017

@brianteeman reset on "not tested". Will test using English lang.

avatar franz-wohlkoenig
franz-wohlkoenig - comment - 20 May 2017

couldn't test cause Issue

avatar brianteeman
brianteeman - comment - 4 Jan 2018

Please mark RTC as it has two good tests

avatar Quy Quy - change - 4 Jan 2018
Status Pending Ready to Commit
avatar Quy
Quy - comment - 4 Jan 2018

RTC

avatar mbabker mbabker - change - 8 Jan 2018
Status Ready to Commit Fixed in Code Base
Closed_Date 0000-00-00 00:00:00 2018-01-08 17:14:25
Closed_By mbabker
Labels Added: ?
avatar mbabker mbabker - close - 8 Jan 2018
avatar mbabker mbabker - merge - 8 Jan 2018

Add a Comment

Login with GitHub to post a comment