User tests: Successful: Unsuccessful:
While testing the performance of the smart search using a large amount of articles (10.000+) we noticed a big increase in response time. Checking the queries with the debug tool revealed two queries which were performing quite poorly, taking up around 1.5s of the total 1.7s response time (12.000 articles). This effect scales up with an increasing amount of articles. In this PR we replaced these two queries with more efficent ones, reducing the total response time of these queries to approximately 350ms (from 1.5s).
We replaced the old query, which was responsible for searching the db given one or multiple search terms and returning the weighted results:
SELECT l.link_id, l.object,SUM(m.weight) AS ordering
FROM #__finder_links AS l
INNER JOIN `#__finder_links_terms` AS m ON m.link_id = l.link_id
WHERE `l`.`access` IN (:preparedArray1,:preparedArray2,:preparedArray3) AND l.state = 1 AND l.published = 1 AND (l.publish_start_date IS NULL OR l.publish_start_date <= '2021-08-24 12:03:00') AND (l.publish_end_date IS NULL OR l.publish_end_date >= '2021-08-24 12:03:00') AND m.term_id IN (18,103)
GROUP BY l.link_id,l.object
HAVING SUM(CASE WHEN m.term_id IN (18) THEN 1 ELSE 0 END) > 0 AND SUM(CASE WHEN m.term_id IN (103) THEN 1 ELSE 0 END) > 0
ORDER BY ordering DESC LIMIT 20
With an updated, faster query:
SELECT l.link_id, l.object,m.weight AS ordering
FROM (
SELECT link_id, SUM(weight) as weight
FROM #__finder_links_terms
WHERE term_id IN (18,103)
GROUP BY link_id
HAVING COUNT(link_id) > 1) AS m
INNER JOIN `#__finder_links` AS l ON l.link_id = m.link_id
WHERE `l`.`access` IN (:preparedArray1,:preparedArray2,:preparedArray3) AND l.state = 1 AND l.published = 1 AND (l.publish_start_date IS NULL OR l.publish_start_date <= '2021-08-24 12:04:00') AND (l.publish_end_date IS NULL OR l.publish_end_date >= '2021-08-24 12:04:00')
ORDER BY ordering DESC LIMIT 20
The difference in performance comes from grouping the data before joining the two large tables. We also reduced the GROUP BY
to only group by l.link_id
, since there are no duplicate l.object
for any given l.link_id
because both are part of the #_finder_links
table, where link_id
is the primary key.
This also has the positive side effect, that the new query is more efficient to use with the _getListCount($query)
of the BaseDatabaseModel
-class.
These changes were implemented in the getListQuery()
of the SearchModel
-class. We've also refactored the method to work with the new query.
Test this on a clean version of 4.0
The loading time is pretty slow (approximately 1.5 s for 12k articles). These were the two responsible queries before the changes:
The loading time should be significantly lower and the search results should be the same. Also these two queries should appear:
Probably not.
Thanks to @Hackwar for answering our questions.
Status | New | ⇒ | Pending |
Category | ⇒ | Front End com_finder |
At a first view the change looks valid to me. I have to check more detailed when I find enough time.
For those who like to understand better: The thing in the new query in the "FROM" clause is not a subquery, it is an instant view. It would be a subquery if it would appear in the "WHERE" clause.
Instant views can be used to reduce the result set at an early point in a query and - if done well and having indexes where needed - can improve performance a lot according to my experience, in opposite to subqueries which can be bad for performance.
Thanks for the feedback and taking the time to provide a more detailed explanation!
Labels |
Added:
?
|
This new part
SELECT link_id, SUM(weight) as weight
FROM #__finder_links_terms
WHERE term_id IN (18,103)
GROUP BY link_id
HAVING COUNT(link_id) > 1)
is still pulling back almost 26,000 rows though...
This new part
SELECT link_id, SUM(weight) as weight FROM #__finder_links_terms WHERE term_id IN (18,103) GROUP BY link_id HAVING COUNT(link_id) > 1)
is still pulling back almost 26,000 rows though...
Yes, there are still 26.000 rows which have to be searched but it is still more performant because the new behaviour explained here.
This pull request has automatically rebased to 4.2-dev.
This pull request has been automatically rebased to 4.3-dev.
Title |
|
This pull request has been automatically rebased to 5.2-dev.
Title |
|
Status | Pending | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2024-07-20 21:55:58 |
Closed_By | ⇒ | Hackwar | |
Labels |
Added:
Feature
PR-5.0-dev
PR-5.2-dev
Removed: ? |
I'm really sorry for keeping you waiting for so long. I wanted to look at this several times in the last few years, but I feared that I wouldn't come with good news and thus procrastinated reviewing this PR. Unfortunately this wasn't entirely unfounded, since this PR unfortunately does not work. I tried a search in our testing sampledata with two taxonomies and the result was, that I only got one result instead of the 12 I expected. I also didn't see a way to fix that and thus unfortunately have to refuse this PR. Thank you for your work and I hope maybe you still want to contribute in the future.
Please can you post the
EXPLAIN
results for each query (The before and after ones), as run, with your data set, as a screenshot.