User tests: Successful: Unsuccessful:
Joomla has two search components since Joomla 1.6, com_search and com_finder. Both have several issues and both lack features and useability. This PR is an attempt to largely rewrite both and their accompanying plugins and code. This is an ongoing development effort and submitted as a work in progress so that others can comment and help steer the direction.
I will update this list and mark done parts accordingly
Finder currently has 16 tables to store the mapping between term and content. This was done for performance reasons, since these tables might become really big really fast and for large sites it would be better to have several smaller tables, or at least that was the theory. That theory might have been correct in 2008/2009, when Finder was written, (I don't know enough about this to judget that) but it is not correct in 2016 anymore.
Finder basically implemented database table partitioning in PHP, which I would call a bad idea. MySQL supports partitioning since 5.6. I would argue that we should drop this PHP implementation for several reasons:
To me it at least feels as if with these changes, indexing the sample data is quicker than before. Searching should also be a bit quicker. This will require some updates to the documentation. We have several places, where we talk about the 16 mapping tables.
The main goal is to merge the two components. com_search is needed, because there are several extensions out there that only provide a search plugin for com_search. com_finder on the other hand does provide the indexed search. The solution will be to move the necessary code from com_finder into com_search and change the search-code of com_finder (basically the search model of com_finder) into a search plugin. I would like to research the possibilities in using new plugin events and providing interfaces for a search plugin to implement, to both merge the "search" and "finder" plugins into one plugin and provide something that developers can work with. Old search plugins could be still used by providing a wrapper-legacy-plugin like we have for the legacy component routers.
Right now, com_search does not have any weighting done for the search results. Results simply come out the way the search plugins push them in, which could simply be the way they have been added to the database or completely random. For most cases however, we want search results that are ordered by relevance. For that, each result should have a weight that describes how relevant this result is and by which we can sort the whole result set. To prevent developers from messing with this by simply setting the weight to some ridicoulus number, so that the results from their plugin is always at the top, we limit the weights to a value between 0 and 1, using a default of 0.5 when no weight is given.
Instead of splitting words by spaces (and thus preventing languages like chinese from being indexed), we will have to provide a better, language specific tokenizer. Instead of hardcoding a few stemmers into the core codebase, we will "push" this to the translation teams. They will have to provide a stemmer in their language pack to stem the words properly. I will try to provide a few stemmers where I can get a hold of them. Instead of indexing tuples of 1-3 words, we are going to index just single words per default and longer tuples only on request. To be honest, I would remove multiple-word-indexes altogether. Searching should also be improved to not only return results for consecutive words (=3-word-tuples), but for words all over the content. Searching for "Beginner Joomla" in the sample data should return the "Beginners" article...
I personally don't understand that feature yet and I think it is something that a lot of people have issues with. Maybe we can find some way to make this useable.
The code is from somewhere around 2009 and while it is of high quality, some of it has been superseeded by changes in our framework. Also, with all the above changes, some cleanup will be necessary.
This PR is based on #12592. Since writing a dozen PRs up and down, I opted for creating just one big one instead.
I invite you to help me to make this happen. Feel free to comment or write code yourself and send a PR against my branch to get this done.
Status | New | ⇒ | Pending |
Labels |
Added:
?
|
Category | ⇒ | Administration Components Front End SQL Installation Postgresql |
Category | Administration Components Front End SQL Installation Postgresql | ⇒ | Administration Components |
Doing all of this to just War and Peace alone brings this down to 5s. The file would still be 3.2 MB
Category | Administration Components | ⇒ | Administration com_finder com_search Components |
Merging these two components just doesn't feel right. Ya, it's an interesting situation shipping two search components as part of core, but the two components take such vastly different approaches to the concept of search that I don't think you can sanely manage both of those through a single component and use a plugin to "bridge" the more advanced and extremely technical approach into the "simple" component. The other changes look promising and should be continued on, it's just this one aspect that I'm not quite comfortable with at the moment.
Title |
|
I will need a lot more time to work on this. I'm not sure if we should close this for now or want to keep it open as a reminder... Fixing the merge conflicts right now does not really help, since this is a work in progress. (Yes, I know, half a year without any action is a long time.)
Milestone |
Added: |
Milestone |
Added: |
As an end user I had dropped using Finder on my multilanguage site of 30K+ articles for the following reasons:
1- Finder indexer at backend was starting, doing some job but was unable to index all site in most cases because of PHP/MySQL restrictions. I needed to run indexer many times using backend or cli indexer.
2- Finder tables were topping a few gigs in total after indexing was finished in some way. It tries indexing every portion of a sentence and thus indexes unnecessary search terms. It certainly needs a vast list of terms in every language that will not be included in Finder tables.
3- Finder was unable to index some multi-byte languages such as Chinese
In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes and stops: The table '#__finder_tokens' is full. Re-indexing does not add new previously not indexed articles. It stops at the exact location of the previous run.
This error message keeps appearing more quickly after every update.
Joomla once had professional search, I am sorry to see it deteriorate like this.
In a joomla 3.8.2 site with less than 300 articles the smart search indexer runs for a few minutes
Speed of the indexer (smart search) is mostly effected by
It stops at the exact location of the previous run.
Triggering content plugins is considered "necessary" to create the real text for indexing
(the above is for the general case, many websites simply do not really need it),
but it can
i think it should be configurable to triggering or not content plugins, it many websites do not need to add text that depends on plugins
or in J4.0 we could require that plugins have a different method for the search indexer ?
e.g. onContentPrepareSearchText
@onderzoekspraktijk
about debugging your issue, better post in forums, not in this PR
@ggppdk:
Thanks for responding to my cri de coeur. Smart search in my opinion is beyond debugging, it is very unstable where once it was very robust. I have been told it is very complicated software that really nobody wants to rewrite. I fear to use it now, and I find this is a great loss to Joomla.
I'm still trying to find the time to work on this for Joomla 4.0. Wish me luck
Status | Pending | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2018-04-17 11:44:07 |
Closed_By | ⇒ | Hackwar |
So, I've been looking into the tokenizer of com_finder and did some testing. The tokenizer of com_finder splits strings in steps of 2kb at a time and then processes it. Quite frankly, I think that is bullshit. We are not working with Arduinos that virtualise a CPU and what not, so that we have to expect to exhaust our resources with 2kb of data. So I tried a tokenizer that does all the splitting in a better way and at the same time does not split this up in god knows how many steps. I then took a stemmer into the equation and stemmed all words and measured the time all of this took. To test this, I took a 6MB file with several books in it (Sherlock Holmes, War & Peace, The History of the USA, 1.1 million words). Tokenizing that into single words takes 1.4 seconds on my machine, sending that array of words through array_unique() (down to 32000 words) pushes that to 4 seconds and sending all of this through a Porter Stemmer brings us to a total of 12 seconds. Again calling array_unique() on this didn't change the execution time (it actually consistently pushed this down from 12s to 11.5s) but brought the word count down to 20000 words.
So I'm going to remove such pseudo-optimizations from the codebase. Joomla should try to be grounded in reality and yes, there might be websites out there that have gigantic texts that are even larger than this, but I doubt that those will be done with Joomla. Now my text was plaintext without any HTML tags. Adding those in would most likely inflate the text by 20-30%. Adding another 100% onto this for good measure and we are still below the magic number of 30s for execution time. But we are at the limit of the #__content introtext column, which takes 16MB max. I guess it is safe to say that 99.9% of all content in a single article in Joomla sites is going to be smaller than War and Peace and definitely smaller than this test-file that I used. So I'm pretty sure that we can take the naive approach here. There can't be that crappy hosts out there with such big sites that this is going to be a problem.