? Success

User tests: Successful: Unsuccessful:

avatar Hackwar
Hackwar
31 Oct 2016

Joomla has two search components since Joomla 1.6, com_search and com_finder. Both have several issues and both lack features and useability. This PR is an attempt to largely rewrite both and their accompanying plugins and code. This is an ongoing development effort and submitted as a work in progress so that others can comment and help steer the direction.

Planned changes

  • Remove the partitioning tables
  • Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search
  • Introduce result weighting and an overhaul of the search result data structure for com_search search plugins
  • Improve indexing and searching in smart search by using proper stemmers, configurable tuple length, etc. In this step, both the stemmer and tokenizer have to move into the language packs, thus allowing support for chinese, etc. Also reviewing the search process itself, since I think there are some flaws in there...
  • Understand the filter and content map feature of Finder and see how this could be improved and maybe made more understandable.
  • Code cleanup.

I will update this list and mark done parts accordingly

Remove the partitioning tables

Finder currently has 16 tables to store the mapping between term and content. This was done for performance reasons, since these tables might become really big really fast and for large sites it would be better to have several smaller tables, or at least that was the theory. That theory might have been correct in 2008/2009, when Finder was written, (I don't know enough about this to judget that) but it is not correct in 2016 anymore.

Finder basically implemented database table partitioning in PHP, which I would call a bad idea. MySQL supports partitioning since 5.6. I would argue that we should drop this PHP implementation for several reasons:

  1. The performance gain for this partitioning only really has an effect on large sites. Those sites however should rather implement real partitioning in the database than using this here.
  2. For smaller sites it actually has a detrimental effect
  3. It is a lot of code and additional tables that confuse people, including me.

To me it at least feels as if with these changes, indexing the sample data is quicker than before. Searching should also be a bit quicker. This will require some updates to the documentation. We have several places, where we talk about the 16 mapping tables.

Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search

The main goal is to merge the two components. com_search is needed, because there are several extensions out there that only provide a search plugin for com_search. com_finder on the other hand does provide the indexed search. The solution will be to move the necessary code from com_finder into com_search and change the search-code of com_finder (basically the search model of com_finder) into a search plugin. I would like to research the possibilities in using new plugin events and providing interfaces for a search plugin to implement, to both merge the "search" and "finder" plugins into one plugin and provide something that developers can work with. Old search plugins could be still used by providing a wrapper-legacy-plugin like we have for the legacy component routers.

Introduce result weighting and an overhaul of the search result data structure for com_search search plugins

Right now, com_search does not have any weighting done for the search results. Results simply come out the way the search plugins push them in, which could simply be the way they have been added to the database or completely random. For most cases however, we want search results that are ordered by relevance. For that, each result should have a weight that describes how relevant this result is and by which we can sort the whole result set. To prevent developers from messing with this by simply setting the weight to some ridicoulus number, so that the results from their plugin is always at the top, we limit the weights to a value between 0 and 1, using a default of 0.5 when no weight is given.

Improve indexing and searching in smart search

Instead of splitting words by spaces (and thus preventing languages like chinese from being indexed), we will have to provide a better, language specific tokenizer. Instead of hardcoding a few stemmers into the core codebase, we will "push" this to the translation teams. They will have to provide a stemmer in their language pack to stem the words properly. I will try to provide a few stemmers where I can get a hold of them. Instead of indexing tuples of 1-3 words, we are going to index just single words per default and longer tuples only on request. To be honest, I would remove multiple-word-indexes altogether. Searching should also be improved to not only return results for consecutive words (=3-word-tuples), but for words all over the content. Searching for "Beginner Joomla" in the sample data should return the "Beginners" article...

Improve filter and content map feature of Finder

I personally don't understand that feature yet and I think it is something that a lot of people have issues with. Maybe we can find some way to make this useable.

Code cleanup

The code is from somewhere around 2009 and while it is of high quality, some of it has been superseeded by changes in our framework. Also, with all the above changes, some cleanup will be necessary.

Notes

This PR is based on #12592. Since writing a dozen PRs up and down, I opted for creating just one big one instead.
I invite you to help me to make this happen. Feel free to comment or write code yourself and send a PR against my branch to get this done. ?

Votes

# of Users Experiencing Issue
1/1
Average Importance Score
5.00

avatar Hackwar Hackwar - open - 31 Oct 2016
avatar Hackwar Hackwar - change - 31 Oct 2016
Status New Pending
avatar joomla-cms-bot joomla-cms-bot - change - 31 Oct 2016
Labels Added: ?
avatar joomla-cms-bot joomla-cms-bot - change - 31 Oct 2016
Category Administration Components Front End SQL Installation Postgresql
avatar joomla-cms-bot joomla-cms-bot - change - 4 Nov 2016
Category Administration Components Front End SQL Installation Postgresql Administration Components
avatar Hackwar
Hackwar - comment - 10 Nov 2016

So, I've been looking into the tokenizer of com_finder and did some testing. The tokenizer of com_finder splits strings in steps of 2kb at a time and then processes it. Quite frankly, I think that is bullshit. We are not working with Arduinos that virtualise a CPU and what not, so that we have to expect to exhaust our resources with 2kb of data. So I tried a tokenizer that does all the splitting in a better way and at the same time does not split this up in god knows how many steps. I then took a stemmer into the equation and stemmed all words and measured the time all of this took. To test this, I took a 6MB file with several books in it (Sherlock Holmes, War & Peace, The History of the USA, 1.1 million words). Tokenizing that into single words takes 1.4 seconds on my machine, sending that array of words through array_unique() (down to 32000 words) pushes that to 4 seconds and sending all of this through a Porter Stemmer brings us to a total of 12 seconds. Again calling array_unique() on this didn't change the execution time (it actually consistently pushed this down from 12s to 11.5s) but brought the word count down to 20000 words.

So I'm going to remove such pseudo-optimizations from the codebase. Joomla should try to be grounded in reality and yes, there might be websites out there that have gigantic texts that are even larger than this, but I doubt that those will be done with Joomla. Now my text was plaintext without any HTML tags. Adding those in would most likely inflate the text by 20-30%. Adding another 100% onto this for good measure and we are still below the magic number of 30s for execution time. But we are at the limit of the #__content introtext column, which takes 16MB max. I guess it is safe to say that 99.9% of all content in a single article in Joomla sites is going to be smaller than War and Peace and definitely smaller than this test-file that I used. So I'm pretty sure that we can take the naive approach here. There can't be that crappy hosts out there with such big sites that this is going to be a problem.

avatar Hackwar
Hackwar - comment - 10 Nov 2016

Doing all of this to just War and Peace alone brings this down to 5s. The file would still be 3.2 MB

avatar joomla-cms-bot joomla-cms-bot - change - 12 Nov 2016
Category Administration Components Administration com_finder com_search Components
avatar mbabker
mbabker - comment - 12 Nov 2016

Merging these two components just doesn't feel right. Ya, it's an interesting situation shipping two search components as part of core, but the two components take such vastly different approaches to the concept of search that I don't think you can sanely manage both of those through a single component and use a plugin to "bridge" the more advanced and extremely technical approach into the "simple" component. The other changes look promising and should be continued on, it's just this one aspect that I'm not quite comfortable with at the moment.

avatar brianteeman brianteeman - edited - 12 Mar 2017
avatar brianteeman brianteeman - change - 20 May 2017
Title
Joomla 4.0: Refactoring com_search & com_finder
[4.0] Refactoring com_search & com_finder
avatar brianteeman
brianteeman - comment - 20 May 2017

@Hackwar could you resolve the merge conflicts please

avatar Hackwar
Hackwar - comment - 21 May 2017

I will need a lot more time to work on this. I'm not sure if we should close this for now or want to keep it open as a reminder... Fixing the merge conflicts right now does not really help, since this is a work in progress. (Yes, I know, half a year without any action is a long time.)

avatar brianteeman brianteeman - change - 8 Jun 2017
Milestone Added:
avatar brianteeman brianteeman - change - 8 Jun 2017
Milestone Added:
avatar humblehumanbeing
humblehumanbeing - comment - 26 Jun 2017

As an end user I had dropped using Finder on my multilanguage site of 30K+ articles for the following reasons:
1- Finder indexer at backend was starting, doing some job but was unable to index all site in most cases because of PHP/MySQL restrictions. I needed to run indexer many times using backend or cli indexer.
2- Finder tables were topping a few gigs in total after indexing was finished in some way. It tries indexing every portion of a sentence and thus indexes unnecessary search terms. It certainly needs a vast list of terms in every language that will not be included in Finder tables.
3- Finder was unable to index some multi-byte languages such as Chinese

avatar onderzoekspraktijk
onderzoekspraktijk - comment - 3 Dec 2017

In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes and stops: The table '#__finder_tokens' is full. Re-indexing does not add new previously not indexed articles. It stops at the exact location of the previous run.
This error message keeps appearing more quickly after every update.
Joomla once had professional search, I am sorry to see it deteriorate like this.


This comment was created with the J!Tracker Application at issues.joomla.org/tracker/joomla-cms/12664.

avatar ggppdk
ggppdk - comment - 4 Dec 2017

In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes

Speed of the indexer (smart search) is mostly effected by

  • content plugin triggering
  • db updating

It stops at the exact location of the previous run.

Triggering content plugins is considered "necessary" to create the real text for indexing
(the above is for the general case, many websites simply do not really need it),

but it can

  • severly effect speed depending on the (core or 3rd party) content/system plugins having code for the event for 'Content Preparing'
  • and it can cause errors, so if you check the server response at the network TAB of your browser console you will propably find the plugin causing this issue, also you might want to set error reporting to maximum and enable DEBUG mode

i think it should be configurable to triggering or not content plugins, it many websites do not need to add text that depends on plugins

or in J4.0 we could require that plugins have a different method for the search indexer ?
e.g. onContentPrepareSearchText

@onderzoekspraktijk
about debugging your issue, better post in forums, not in this PR

avatar onderzoekspraktijk
onderzoekspraktijk - comment - 4 Dec 2017

@ggppdk:
Thanks for responding to my cri de coeur. Smart search in my opinion is beyond debugging, it is very unstable where once it was very robust. I have been told it is very complicated software that really nobody wants to rewrite. I fear to use it now, and I find this is a great loss to Joomla.

avatar Hackwar
Hackwar - comment - 5 Dec 2017

I'm still trying to find the time to work on this for Joomla 4.0. Wish me luck

avatar onderzoekspraktijk
onderzoekspraktijk - comment - 5 Dec 2017

@Hackwar:

I do wish you luck, and time, and if it would help I also wish you a great group of sponsors!

avatar Hackwar
Hackwar - comment - 17 Apr 2018

This is being made obsolete by among other things this: #20185

avatar Hackwar Hackwar - close - 17 Apr 2018
avatar Hackwar Hackwar - change - 17 Apr 2018
Status Pending Closed
Closed_Date 0000-00-00 00:00:00 2018-04-17 11:44:07
Closed_By Hackwar

Add a Comment

Login with GitHub to post a comment