User tests: Successful: Unsuccessful:
Common words (also known as "stop words") in Smart Search were being assigned too much weight in search queries and were not flagged as being common. The reason turned out to be that the default language "*" was not being recognised as matching the language code used in the common words table ("en"). Thus common words were simply not being recognised as such.
Note that this only affects English because no other language has common words at the present time (unless you added them to the database yourself).
This PR amends the FinderIndexerHelper::isCommon method so that "*" is recognised as shorthand for the default language.
It's difficult to construct a search query where it makes a significant difference to the outcome. In fact, I've given up trying! The easiest way is to simply look in the #__finder_terms table and notice that the word "the" has dropped from a weight of 0.2 to 0.025 and it has a 1 in the "common" column. Prior to applying this PR there wouldn't be any terms flagged as common.
Note that you will need to purge and re-index after applying this PR. Re-indexing without purging will not force the weights to be recalculated.
Fixing this bug is really about correctly labelling common words so as to pave the way for more sophisticated ranking algorithms in the future.
None. This is a bug fix.
Status | New | ⇒ | Pending |
Labels |
Added:
?
|
Category | ⇒ | Administration Components |
Category | Administration Components | ⇒ | Administration Components Search |
Yes, purge = clear index. It's still --purge on the cli.
For anything other than testing this PR re-indexing isn't important. As I noted in the summary, I very much doubt anyone will notice the difference anyway. I wouldn't want people to think they have to re-index, but the next time they do, they'll get the new weights.
I know it's OT, but how is the table #__finder_terms_common
populated?
AFAIK only on new installation (see com_finder installation file), but I think this should be part of language xml manifest.
There is also no user interface to manage the list.
@piotr-cz You are correct and I would like to change that.
There needs to be a mechanism for including a common words table in language packs. We also need a mechanism for overriding entries for a specific website so that it can be tuned for the particular statistical distribution of words found on that site. And we need a more sophisticated mechanism for site administrators to influence the ranking calculations. Do we even need a common words database table? We could just load the common words into memory from a JSON file in the language pack as needed, then load an override file from another location to override/customise it. There are many possibilities and your suggestions are welcome.
But, one step at a time. :-)
I have tested this item
I have tested this item
database changes observed
Easy | No | ⇒ | Yes |
Labels |
Removed:
?
|
Category | Administration Components Search | ⇒ | Administration com_finder Components Search |
I have tested this item
Test on Joomla! 3.7.0-alpha1. Looked for german-lang "die" (similar to "the"), before and after Patch weight: 0.2.
@franz-wohlkoenig I'm afraid that this works only for English language at the moment, see #12450 (comment)
Easy | Yes | ⇒ | No |
@franz-wohlkoenig could you reset your test result please
I have not tested this item.
@brianteeman reset on "not tested". Will test using English lang.
Please mark RTC as it has two good tests
Status | Pending | ⇒ | Ready to Commit |
RTC
Status | Ready to Commit | ⇒ | Fixed in Code Base |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2018-01-08 17:14:25 |
Closed_By | ⇒ | mbabker | |
Labels |
Added:
?
|
I assume "purge" means using the "Clear Index" button?
Will this not need to be documented in the upgrade notes as people dont
usually "clear the index" do they (I dont use Smart Search)?
On 17 October 2016 at 23:26, Chris Davenport notifications@github.com
wrote:
Brian Teeman
Co-founder Joomla! and OpenSourceMatters Inc.
https://brian.teeman.net/ http://brian.teeman.net/