? Pending

User tests: Successful: Unsuccessful:

avatar ggppdk
ggppdk
11 Oct 2018

Pull Request for Issue #22098

Summary of Changes

This PR re-adds chunking (for the VALUES clause) to the smart search indexer

Testing Instructions

Try to save a "big" article having 100K text (or a more ?)
while having MySQL default configuration of
max_allowed_packet=1M

Expected result

Articles saving succeeds

Actual result

Articles saving fails

Documentation Changes Required

None

Votes

# of Users Experiencing Issue
1/1
Average Importance Score
4.00

avatar ggppdk ggppdk - open - 11 Oct 2018
avatar ggppdk ggppdk - change - 11 Oct 2018
Status New Pending
avatar joomla-cms-bot joomla-cms-bot - change - 11 Oct 2018
Category Administration com_finder
avatar N8Solutions
N8Solutions - comment - 11 Oct 2018

Hello @ggppdk Thank you for looking in to this!

Yes, I can confirm that this works. I have tested it on 2 websites.

To clarify something though, as I wrote when I opened #22098

Actual result

Saving the article will produce a blank page. The article is saved (only visible by viewing the article on the front-end) but the page does not refresh to the Article Edit Screen.
With the "Error Reporting" turned to Maximum, instead of a blank page, a memory error is displayed like the one below:

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 1052672 bytes) in /var/www/vhosts/domain.com/libraries/joomla/database/driver.php on line 2049

avatar frankmayer
frankmayer - comment - 11 Oct 2018

@ggppdk thanks for stepping in, I just saw this. Also, Περαστικά :) !

The other way to keep this running and reintroduce the chunking while still checking for the Object class=FinderIndexerToken and keeping the optimizations, whould be this:

    protected function addTokensToDb($tokens, $context = '')
    {
        // Get the database object.
        $db = $this->db;


        // Count the number of token values.
        $values = 0;

        if (($tokens instanceof FinderIndexerToken) === false)
        {
            // Break into chunks of no more than 1000 items
            $chunks = array_chunk($tokens, 1000);

            foreach ($chunks as $tokens)
            {
                // Cloning a new query template is twice as fast as calling the clear function 
                $query = clone $this->addTokensToDbQueryTemplate;

                // Iterate through the tokens to create SQL value sets.
                foreach ($tokens as $token)
                {
                    $query->values(
                        $db->quote($token->term) . ', '
                        . $db->quote($token->stem) . ', '
                        . (int) $token->common . ', '
                        . (int) $token->phrase . ', '
                        . $db->escape((float) $token->weight) . ', '
                        . (int) $context . ', '
                        . $db->quote($token->language)
                    );
                    ++$values;
                }

                $db->setQuery($query)->execute();
            }
        }
        else
        {
            $query = clone $this->addTokensToDbQueryTemplate;

            $query->values(
                $db->quote($tokens->term) . ', '
                . $db->quote($tokens->stem) . ', '
                . (int) $tokens->common . ', '
                . (int) $tokens->phrase . ', '
                . $db->escape((float) $tokens->weight) . ', '
                . (int) $context . ', '
                . $db->quote($tokens->language)
            );
            ++$values;

            $db->setQuery($query)->execute();
        }

        return $values;
    }

I checked this on my setup by repeating 6x the lorem ipsum text (https://issues.joomla.org/uploads/1/4031df5efeb47a56ecb62392e76d663c.txt) that was given in the issue description.
This resulted in an article text of around 213KiB and worked fine. Repeating the LoremIpsum text 7x resulted in the Save failed with the following error: The table '#__finder_tokens' is full message (Tested up to 14x the Lorem Ipsum text).

If I put the Lorem Ipsum text x15, I get a blank screen.
However, I think that this was the case before the work on the indexer. 15x the LoremIpsum text would be around 500KiB. I don't think anyone has such a large article. And the Save failed with the following error: The table '#__finder_tokens' is full error message is kicking in much earlier than that.

For completeness, turning on error reporting to development on the "blank-page" after save, gives me this:
Note: The execution fails at the serialize method of the registry. Not in the indexer. Haven't looked into that though.

( ! ) Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 827392 bytes) in Unknown on line 0
--


1 | 0.0001 | 1712248 | {main}(  ) | .../index.php:0
2 | 0.0187 | 5057824 | Joomla\CMS\Application\CMSApplication->execute(  ) | .../index.php:51
3 | 0.0187 | 5057824 | Joomla\CMS\Application\AdministratorApplication->doExecute(  ) | .../CMSApplication.php:195
4 | 0.0353 | 6158192 | Joomla\CMS\Application\AdministratorApplication->dispatch(  ) | .../AdministratorApplication.php:159
5 | 0.0364 | 6285592 | Joomla\CMS\Component\ComponentHelper::renderComponent(  ) | .../AdministratorApplication.php:101
6 | 0.0368 | 6317696 | Joomla\CMS\Component\ComponentHelper::executeComponent(  ) | .../ComponentHelper.php:357
7 | 0.0368 | 6334672 | require_once( 'PATH_REMOVED/administrator/components/com_content/content.php' ) | .../ComponentHelper.php:382
8 | 0.0400 | 6368312 | Joomla\CMS\MVC\Controller\BaseController->execute(  ) | .../content.php:21
9 | 0.0400 | 6368312 | Joomla\CMS\MVC\Controller\FormController->save(  ) | .../BaseController.php:710
10 | 0.1038 | 7038272 | ContentModelArticle->save(  ) | .../FormController.php:747
11 | 0.1056 | 7117688 | Joomla\CMS\MVC\Model\AdminModel->save(  ) | .../article.php:666
12 | 0.2384 | 7978104 | JEventDispatcher->trigger(  ) | .../AdminModel.php:1241
13 | 0.2384 | 7978552 | JEvent->update(  ) | .../dispatcher.php:160
14 | 0.2384 | 7978552 | PlgContentFinder->onContentAfterSave(  ) | .../event.php:70
15 | 0.2384 | 7978928 | JEventDispatcher->trigger(  ) | .../finder.php:38
16 | 0.2384 | 7979376 | JEvent->update(  ) | .../dispatcher.php:160
17 | 0.2384 | 7979376 | PlgFinderContent->onFinderAfterSave(  ) | .../event.php:70
18 | 0.2384 | 7979376 | FinderIndexerAdapter->reindex(  ) | .../content.php:151
19 | 0.2800 | 8651936 | PlgFinderContent->index(  ) | .../adapter.php:332
20 | 0.2881 | 8953288 | FinderIndexerDriverMysql->index(  ) | .../content.php:313
21 | 0.4648 | 11630928 | FinderIndexer->tokenizeToDb(  ) | .../mysql.php:247
22 | 0.4648 | 11630928 | FinderIndexer->tokenizeToDbShort(  ) | .../indexer.php:453
23 | 3.1473 | 120456112 | FinderIndexer->addTokensToDb(  ) | .../indexer.php:486
24 | 3.1474 | 120460104 | array_chunk (  ) | .../indexer.php:522
25 | 3.1613 | 130412240 | JSessionHandlerNative->save(  ) | .../native.php:0
26 | 3.1614 | 130412240 | Joomla\CMS\Session\Session->getData(  ) | .../native.php:189
27 | 3.1614 | 130412744 | Joomla\Registry\Registry->__clone(  ) | .../Session.php:386
28 | 3.1614 | 130412744 | serialize (  ) | .../Registry.php:89
avatar frankmayer
frankmayer - comment - 12 Oct 2018

@ggppdk Note, I changed above mentioned code a bit and used the language construct instanceof instead of the function is_a(), resulting in a bit more performance (varying on if an array or an object is passed to the method).

avatar ggppdk
ggppdk - comment - 12 Oct 2018

@frankmayer

ok, thanks , i will update the PR in late evening,

just i think that the benefits of this are not measurable
the single token passing case is rare (last time i checked)
as it is very performance unwise to make such calls ... thus such calls are almost not used

and even if they are made common, then PHP would be the last of our worries comparing to the multiple SQL update queries

avatar frankmayer
frankmayer - comment - 12 Oct 2018

@ggppdk yes, in the rare case of a single token, it will not be a lot if you are indexing a single article, but it will make a little difference in any case, if you are for example reindexing a site with lots of articles.

Sure, the SQL queries are the bigger problem here, but it is not only this process that runs on a machine.

One thing we mostly do not think of, is that sites are mostly handling multiple parallel requests, so optimizing things and cutting a few cpu cycles here and there, can help overall performance of a site.
Multiply that by a lot of sites and you can lower the carbon footprint a bit. :)
I mean this in general, not specifically for this issue...

BTW, I made one final change in the code above, and replaced the $query->clear() call with cloning a new object from the query template. This cuts the time spent on the reset of the query (values only) in half. Keep it or trash it, whatever you like ;) It will only show a small difference in big articles, like the ones we are testing on in this issue.

avatar ggppdk ggppdk - change - 12 Oct 2018
Labels Added: ?
avatar ggppdk
ggppdk - comment - 12 Oct 2018

@frankmayer

i have added your code,
along with conditional call to array_chunk

The call to array_chunk is O(n) where n is the size of the array
and in 99% of the time we do not need to call array_chunk since size of array is < 1000

We can probably remove completely array_chunk
since it seems to me that the indexes of the tokens array are always 0 to n-1 integers ...
anyway with new code it will rarely be called

avatar frankmayer
frankmayer - comment - 13 Oct 2018

@ggppdk Somthing in your changes caused the memory exhaustion to appear with less text.
I tested the same things with your latest changes, that I tested with my code and I got the exhaustion with 9x the lorem ipsum text. With the previous code it was after 15x the text. I think, though you're passing the tokens array via reference, it somehow uses twice the memory.
Of course, again the finder_token table is full comes when using 7 and 8 times the lorem ipsum text, but still we have probably doubled the memory consumption here, which is not that good ;)

avatar ggppdk
ggppdk - comment - 13 Oct 2018

hhmm, i ll test this

avatar ggppdk
ggppdk - comment - 13 Oct 2018

I think, though you're passing the tokens array via reference, it somehow uses twice the memory.

ok,
I see why the bigger memory consumption

The loop code after the reference to the tokens array,
use the same variable name as the variable that was referenced,
a copy of the tokens array is forced.

I will test properly later and commit, avoid both the array_chunk and the extra memory consumption

avatar frankmayer
frankmayer - comment - 13 Oct 2018

@ggppdk I have some thoughts on maybe improving the indexer a bit more (Even with getting rid of the memory exhaustion in general). Would you consider either getting this PR trough, with the state of the function as I had suggested above in order to provide a fix for the issue (so that I can work in a separate PR on improvements), or alternatively letting me replace this PR with a new one?
After all, my PR broke it for the people with big articles (sorry guys)

avatar ggppdk
ggppdk - comment - 13 Oct 2018

@frankmayer

sure,
please go ahead and use the comments / work that you have already contributed in this discussion to create a new PR

thanks

avatar ggppdk ggppdk - change - 13 Oct 2018
Status Pending Closed
Closed_Date 0000-00-00 00:00:00 2018-10-13 12:19:02
Closed_By ggppdk
avatar ggppdk ggppdk - close - 13 Oct 2018

Add a Comment

Login with GitHub to post a comment