Joomla! Issue Tracker | Joomla! CMS #12664 - [4.0] Refactoring com_search & com

Closed
17 Apr 2018
Medium
Build: 4.0-dev
# 12664
Diff
Hackwar:j4search3

Success

Success continuous-integration/drone the build was successful Details

User tests: Successful: Unsuccessful:

Hackwar
31 Oct 2016

Joomla has two search components since Joomla 1.6, com_search and com_finder. Both have several issues and both lack features and useability. This PR is an attempt to largely rewrite both and their accompanying plugins and code. This is an ongoing development effort and submitted as a work in progress so that others can comment and help steer the direction.

Planned changes

Remove the partitioning tables
Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search
Introduce result weighting and an overhaul of the search result data structure for com_search search plugins
Improve indexing and searching in smart search by using proper stemmers, configurable tuple length, etc. In this step, both the stemmer and tokenizer have to move into the language packs, thus allowing support for chinese, etc. Also reviewing the search process itself, since I think there are some flaws in there...
Understand the filter and content map feature of Finder and see how this could be improved and maybe made more understandable.
Code cleanup.

I will update this list and mark done parts accordingly

Remove the partitioning tables

Finder currently has 16 tables to store the mapping between term and content. This was done for performance reasons, since these tables might become really big really fast and for large sites it would be better to have several smaller tables, or at least that was the theory. That theory might have been correct in 2008/2009, when Finder was written, (I don't know enough about this to judget that) but it is not correct in 2016 anymore.

Finder basically implemented database table partitioning in PHP, which I would call a bad idea. MySQL supports partitioning since 5.6. I would argue that we should drop this PHP implementation for several reasons:

The performance gain for this partitioning only really has an effect on large sites. Those sites however should rather implement real partitioning in the database than using this here.
For smaller sites it actually has a detrimental effect
It is a lot of code and additional tables that confuse people, including me.

To me it at least feels as if with these changes, indexing the sample data is quicker than before. Searching should also be a bit quicker. This will require some updates to the documentation. We have several places, where we talk about the 16 mapping tables.

Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search

The main goal is to merge the two components. com_search is needed, because there are several extensions out there that only provide a search plugin for com_search. com_finder on the other hand does provide the indexed search. The solution will be to move the necessary code from com_finder into com_search and change the search-code of com_finder (basically the search model of com_finder) into a search plugin. I would like to research the possibilities in using new plugin events and providing interfaces for a search plugin to implement, to both merge the "search" and "finder" plugins into one plugin and provide something that developers can work with. Old search plugins could be still used by providing a wrapper-legacy-plugin like we have for the legacy component routers.

Introduce result weighting and an overhaul of the search result data structure for com_search search plugins

Right now, com_search does not have any weighting done for the search results. Results simply come out the way the search plugins push them in, which could simply be the way they have been added to the database or completely random. For most cases however, we want search results that are ordered by relevance. For that, each result should have a weight that describes how relevant this result is and by which we can sort the whole result set. To prevent developers from messing with this by simply setting the weight to some ridicoulus number, so that the results from their plugin is always at the top, we limit the weights to a value between 0 and 1, using a default of 0.5 when no weight is given.

Improve indexing and searching in smart search

Instead of splitting words by spaces (and thus preventing languages like chinese from being indexed), we will have to provide a better, language specific tokenizer. Instead of hardcoding a few stemmers into the core codebase, we will "push" this to the translation teams. They will have to provide a stemmer in their language pack to stem the words properly. I will try to provide a few stemmers where I can get a hold of them. Instead of indexing tuples of 1-3 words, we are going to index just single words per default and longer tuples only on request. To be honest, I would remove multiple-word-indexes altogether. Searching should also be improved to not only return results for consecutive words (=3-word-tuples), but for words all over the content. Searching for "Beginner Joomla" in the sample data should return the "Beginners" article...

Improve filter and content map feature of Finder

I personally don't understand that feature yet and I think it is something that a lot of people have issues with. Maybe we can find some way to make this useable.

Code cleanup

The code is from somewhere around 2009 and while it is of high quality, some of it has been superseeded by changes in our framework. Also, with all the above changes, some cleanup will be necessary.

Notes

This PR is based on #12592. Since writing a dozen PRs up and down, I opted for creating just one big one instead.
I invite you to help me to make this happen. Feel free to comment or write code yourself and send a PR against my branch to get this done. ?

Votes

# of Users Experiencing Issue

1/1

Average Importance Score

5.00

8bad953 27 Oct 2016

Finder: Removing laymans table partitioning

0fec7ac 27 Oct 2016

Codestyle

b4f9937 27 Oct 2016

Merge branch '4.0-dev' of https://github.com/joomla/joomla-cms into j4search2

Hackwar - open - 31 Oct 2016

Hackwar - change - 31 Oct 2016

Status

New

⇒

Pending

joomla-cms-bot - change - 31 Oct 2016

Labels

Added: ?

joomla-cms-bot - change - 31 Oct 2016

Category

Administration Components

⇒

Administration com_finder com_search Components

d1ee979 12 Nov 2016

Merge branch '4.0-dev' of https://github.com/joomla/joomla-cms into j4search3

mbabker - comment - 12 Nov 2016

Merging these two components just doesn't feel right. Ya, it's an interesting situation shipping two search components as part of core, but the two components take such vastly different approaches to the concept of search that I don't think you can sanely manage both of those through a single component and use a plugin to "bridge" the more advanced and extremely technical approach into the "simple" component. The other changes look promising and should be continued on, it's just this one aspect that I'm not quite comfortable with at the moment.

7b94f36 15 Nov 2016

Implementing language specific stemmers and tokenizers

04136ff 24 Nov 2016

Further refactoring of com_search & com_finder

brianteeman - edited - 12 Mar 2017

brianteeman - change - 20 May 2017

Title

…

~~Joomla 4.0:~~ Refactoring com_search & com_finder

[4.0] Refactoring com_search & com_finder

brianteeman - comment - 20 May 2017

@Hackwar could you resolve the merge conflicts please

Hackwar - comment - 21 May 2017

I will need a lot more time to work on this. I'm not sure if we should close this for now or want to keep it open as a reminder... Fixing the merge conflicts right now does not really help, since this is a work in progress. (Yes, I know, half a year without any action is a long time.)

brianteeman - change - 8 Jun 2017

Milestone

Added:

brianteeman - change - 8 Jun 2017

Milestone

Added:

humblehumanbeing - comment - 26 Jun 2017

As an end user I had dropped using Finder on my multilanguage site of 30K+ articles for the following reasons:
1- Finder indexer at backend was starting, doing some job but was unable to index all site in most cases because of PHP/MySQL restrictions. I needed to run indexer many times using backend or cli indexer.
2- Finder tables were topping a few gigs in total after indexing was finished in some way. It tries indexing every portion of a sentence and thus indexes unnecessary search terms. It certainly needs a vast list of terms in every language that will not be included in Finder tables.
3- Finder was unable to index some multi-byte languages such as Chinese

onderzoekspraktijk - comment - 3 Dec 2017

In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes and stops: The table '#__finder_tokens' is full. Re-indexing does not add new previously not indexed articles. It stops at the exact location of the previous run.
This error message keeps appearing more quickly after every update.
Joomla once had professional search, I am sorry to see it deteriorate like this.

_{This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12664.}

ggppdk - comment - 4 Dec 2017

In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes

Speed of the indexer (smart search) is mostly effected by

content plugin triggering
db updating

It stops at the exact location of the previous run.

Triggering content plugins is considered "necessary" to create the real text for indexing
(the above is for the general case, many websites simply do not really need it),

but it can

severly effect speed depending on the (core or 3rd party) content/system plugins having code for the event for 'Content Preparing'
and it can cause errors, so if you check the server response at the network TAB of your browser console you will propably find the plugin causing this issue, also you might want to set error reporting to maximum and enable DEBUG mode

i think it should be configurable to triggering or not content plugins, it many websites do not need to add text that depends on plugins

or in J4.0 we could require that plugins have a different method for the search indexer ?
e.g. onContentPrepareSearchText

@onderzoekspraktijk
about debugging your issue, better post in forums, not in this PR

onderzoekspraktijk - comment - 4 Dec 2017

@ggppdk:
Thanks for responding to my cri de coeur. Smart search in my opinion is beyond debugging, it is very unstable where once it was very robust. I have been told it is very complicated software that really nobody wants to rewrite. I fear to use it now, and I find this is a great loss to Joomla.

Hackwar - comment - 5 Dec 2017

I'm still trying to find the time to work on this for Joomla 4.0. Wish me luck

onderzoekspraktijk - comment - 5 Dec 2017

@Hackwar:

I do wish you luck, and time, and if it would help I also wish you a great group of sponsors!

Hackwar - comment - 17 Apr 2018

This is being made obsolete by among other things this: #20185

Hackwar - close - 17 Apr 2018

Hackwar - change - 17 Apr 2018

Status	Pending	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2018-04-17 11:44:07
Closed_By		⇒	Hackwar

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#12664] - [4.0] Refactoring com_search & com_finder