User tests: Successful: Unsuccessful:
This is a first take at improving Finder and solving our com_search/com_finder issue (namely that we have 2 components to do searching with...)
Finder currently has 16 tables to store the mapping between term and content. This was done for performance reasons, since these tables might become really big really fast and for large sites it would be better to have several smaller tables, or at least that was the theory. That theory might have been correct in 2008/2009, when Finder was written, (I don't know enough about this to judget that) but it is not correct in 2016 anymore.
Finder basically implemented database table partitioning in PHP, which I would call a bad idea. MySQL supports partitioning since 5.6. I would argue that we should drop this PHP implementation for several reasons:
1. The performance gain for this partitioning only really has an effect on large sites. Those sites however should rather implement real partitioning in the database than using this here.
2. For smaller sites it actually has a detrimental effect
3. It is a lot of code and additional tables that confuse people, including me.
To me it at least feels as if with these changes, indexing the sample data is quicker than before. Searching should also be a bit quicker.
This will require some updates to the documentation. We have several places, where we talk about the 16 mapping tables.
Status | New | ⇒ | Pending |
Labels |
Added:
?
|
Category | ⇒ | Administration Components Front End SQL Installation Postgresql |
To be honest, I didn't add update logic yet, because I'm still working on rewriting finder and search in additional PRs... Should I first add update logic to this or would this be mergeable without that? Do you agree with these changes or am I on the wrong track?
I dont think it can be merged until we know if it is even possible to handle updates
Looks like github has once again not parsed an email reply from myself to this rom the other day so I am retyping it now and it will probably re-appear later ;)
MySQL supports partitioning since 5.6.
This would mean that we would have to increase the minimum requirement for Joomla. I know we can do this with J4 but just because we can doesnt mean we should - it will depend on the usage stats for mysql 5.6 (or did I misunderstand)
@brianteeman I understand that MySQL supports this from 5.6 and that we require only a lower version. However, partitioning is not a requirement for this. What I'm trying to say is this:
Would that be acceptable?
I think that will have to go to the PLT for a decision as it is an increase in the system requirements
@jeckodevelopment can you raise this with the PLT please
IMHO we shouldn't even try to handle updates (other than getting rid of the old tables and adding the new one) with this kind of change -- the task would just time out in the sites which actually use the smart search feature. Just make sure that users are aware that they need to run CLI command or go to admin and run task from there to rebuild the search index.
Having good release notes and documentation is the key here.
Other than just making my comment, I really don't have preference on this matter. I don't use smart search anywhere.
Is it possible to add optional(additional) database settings for this purpose? And keep search indexes there. Search indexes usually take about 60-70% of database size.
Keeping the index is not something that we should do. We could do this, but that is actually a lot of work for both us and the site admins and it will be equally fast for the site admins to simply rebuild the index.
I also don't see how to add settings here. Really, we are reducing the size of the database with this and we don't require a change in system requirements.
Again: This pseudo-partitioning was originally done to improve performance for ADDING content to the site. Searching is actually a bit slower due to this, albeit negligble. By removing these tables, we are only making saving articles a bit slower. To keep the performance for that, we would have to add partitioning and indeed raise system requirements. But again: This would only apply to very large sites.
It's not just large sites that need this. The problem that sharding is trying to solve is the performance of the database when faced with large numbers of inserts. This can be a problem on even quite modest sites when saving large articles (articles with a lot of words in them). We're faced with a race to add several thousand entries into the links_terms table while continuing to give the user a reasonably responsive UI (and keep within the typical 30 second time limit). And database insert performance degrades as more entries are added to a table, even though read performance is hardly affected.
Whilst I'm generally supportive of the idea, I think we need to tread very carefully. The biggest question is support on typical hosts for medium-sized sites (small sites don't need it and large sites are probably better off using an external search engine). I don't think we're gathering stats that would really answer that question.
Incidentally, I have an unfinished PR that would add a JTableSharded class that would handle sharding generically for any table and allow the sharding complexity to be pushed into our database/table layer and out of Smart Search itself. The sharding algorithm and the degree of sharding would be configurable so that for small sites with little data you could have just a single table. That approach would simplify the search code without having to increase minimum requirements and is not incompatible with having the database handle the sharding natively.
But to properly evaluate any of these ideas it is necessary to get some solid performance data on a range of sites differing in size of articles as well as number. You can't just go on a feeling that a site with default sample data feels a bit quicker.
From what @chrisdavenport said, I like this idea on moving the logic to JTableSharded.
Sorry, but sharding/partitioning is something that belongs into the database and not the application logic.
Yes, we are trying to insert some several thousand lines into that table, however that would be the next step to work on. I've had very good experiences with a slightly different approach on how to store the terms and that reduces the number of entries by several magnitudes. Basically by filtering on common words, not storing tuples and using a stemmer, you get about a fifth of the terms in the terms-table and equal as many entries in the terms_links-table. At the same time, I had better search results than with Finder right now. Maybe I did something wrong, but when installing the sample data, enabling finder and building the index, afterwards I could search for "Joomla Beginners" and it wouldn't return a result, although there was an article "Beginners" with the word "Joomla" on the homepage.
I agree that ideally we should push sharding/partitioning into the database where it logically belongs. But we must also be pragmatic about the hosting environments we have to work with. If the minimum requirements can be raised to MySQL 5.6 (and don't forget some equivalent for Postgres) for Joomla 4.x then I would have no objection to this PR. That's the most basic question that needs to be resolved first.
I have a long list of improvements that I'd like to make to Smart Search and you touch on some of them. Common word support is minimal (and completely non-functional at the moment; see #12450), tuple-length should be configurable (perhaps with a default of 1 on new installs, but allowing longer tuples where practical) and stemmer support needs work (for example, only the first element in a tuple is stemmed at the moment and the default Snowball stemmer is no longer in any default PHP builds). If you have time to work on these issues, that's great. (Sadly, my time has been gravely limited of late, although I have one client who has recently "sponsored" some time on making Smart Search improvements: I have a fix for the tokeniser "memory leak" coming soon as a result of that).
I don't see how moving the sharding/partitioning into JTable would help us. JTable introduces a lot of overhead and while it is an acceptable solution for managing records like in #__content, I can't see a usefull usage in something like the mapping table of Finder.
My todo list for Search/Smart Search in the order that I will work on them:
I don't have time, but I promised @wilsonge that I would work on this for 4.0, so waddayado.
I've written a similar search feature for a customer several years ago, although it was simply made to work and not as polished as Finder is right now. However the search results were better and the indexer could index the whole site in ~10 seconds without paging via ajax. There also were quite a few articles in that site... (~1000) I don't see why we shouldn't be able to get to something similar with Finder/Search.
Well, JTableSharded may never see the light of day. I haven't looked at it in years and there's little prospect of me finding the time to look at it again in the very near future. And I do agree that sharding is best done natively by the database itself. Incidentally, thank you for pointing out that MySQL 5.6+ now supports sharding/partitioning; I didn't know that before you mentioned it.
Move backend com_finder into com_search and change frontend com_finder to a search plugin for com_search.
Not sure what you have in mind with this. There are almost certainly many more sites using com_search than com_finder and replacing one with the other would likely be a big upheaval for many folks. The aim with Joomla 4 is to break as little as possible.
Introduce result weighting
Smart Search already does that, or did you mean something else?
As for you other suggestions, I look forward to seeing PRs. However, I would urge you to try to make your changes in Joomla 3.x rather than 4.x so that more people can benefit in the near term. Most of these changes can be made without breaking BC.
I'm not talking about replacing com_search with com_finder. I'm talking about integrating the backend functionality of com_finder into com_search and the frontend functionality into a search plugin. There should only be one search component in our system and the finder functionality should be a regular search plugin.
The weighting refers to introducing that to all search plugins. A weight can have values between 0 and 1. If the plugin does not provide a weight, 0.5 will be assumed.
Sorry, but I don't understand that at all. The front-end functionality of com_finder provides search across all content types, so if you put that functionality into a com_search plugin you might as well just switch the other plugins off leaving only the com_finder one. In which case, why have it as a plugin at all? And if it's not a plugin you just end up back where you started, with com_finder as implemented. I know this has nothing to do with this PR, but I just don't see what you're trying to achieve. What problem you are trying to solve?
com_finder provides search across all content types that it supports. Unfortunately there are quite a lot of extensions out there that don't provide a finder plugin, but just one for com_search. There are also a lot of situations, where you can not index the data like Finder needs it, for example when querying an external source. In those cases, we need something like com_search with its plugin structure. That is also the reason why we have com_search and com_finder in the first place. Otherwise com_search would have been replaced with com_finder in 1.6 directly.
We need a solution that allows to add external sources and we need a solution that allows all the old search plugins to be used. The logical solution would be to include Finder as a search plugin in com_search.
What makes you think that com_finder can't index external sources? It actually excels at it because search queries are made against the index rather than directly with the (potentially unreliable) external source. A slow external source would cripple com_search but would only slow com_finder's indexer down, leaving front-end searches unaffected.
The reason we still have two search components is that nobody has found the time to refine com_finder to the point where it can fully replace com_search. The sticking points are around the size of database tables, the performance of the indexer and some issues with multi-language use. All of these problems are solvable. The intention always was, and to my knowledge still is, to deprecate com_search in favour of com_finder at some point. That would have happened in J2.5 if com_finder had been ready back then.
Of course com_finder could index an external source. However I had to write a search plugin where it was not possible to index that external source, because that source did not provide the data. It only had an api to start a search and get the results. In those cases, com_finder would have been screwed.
If my approach is not what the leadership team wants, then tell me as an official statement and I will stop right here and now. Otherwise I will keep on coding and hopefully be able to provide a solution for 4.0.
What makes you think that com_finder can't index external sources?
If data are same forever then yes, they can be "indexed"
in other cases how do you use / combine (in real time) an external source ?
Status | Pending | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2016-10-31 16:08:39 |
Closed_By | ⇒ | Hackwar |
Upgrade logic is missing. Maybe easiest way would be to drop all the existing tables and create the missing one. Then add post install message asking users to recreate index.