?
avatar andykirk
andykirk
23 Sep 2014

Files in question:

(A): \administrator\components\com_finder\helpers\indexer\parser.php
(B): \administrator\components\com_finder\helpers\indexer\parser\html.php

I have found two related issues with the code in the above files.

Firstly in file (A) the parse method checks if the $input is greater than 2048 bytes and if so chunks the $input at the last space.
The problem with this is that it doesn't take into account any HTML tags.
If the last space is inside a tag, the chunking splits the tag before it's passed to the parse method of file (B).
It's this parse method that's responsible for striping HTML tags (strip_tags()) so by the time the $input reaches this function it's too late - the tag is already split the search result description is left with partial tag garbage.

Example:

'Lorum ipsum ...' + 2006 more characters + 'dolor <abbr title="Some useful title">A thing</abbr> ...'

The 2048 character count would finish at the end of 'useful', the last available space is after 'Some'.
Because of the chunking the Finder result description looks like:

'Lorum ipsum ... dolor useful title">A thing ...'

Whereas it should be:

'Lorum ipsum ... dolor A thing ...'

The only ways I can think of solving this would be to either strip_tags() BEFORE chunking (simple but not inline with code separation) or to introduce some proper HTML chunking (complicated) with each parser having it's own chunking method.

The second issue I've found is that line 41 of file (B) ($input = str_replace('>', '> ', $input);) messes up the Finder results description too.
(I'm not sure what this line achieves but I suspect it fixes something that occurs because of the chunking)

Example:

I have a need to stylistically offset certain letters that make up an acronym.
I can't do this with CSS so I need to use tags like this:

'<b>H</b>yper<b>T</b>ext <b>M</b>arkup <b>L</b>anguage'

Because of the line I mentioned (and because it occurs before tags are stripped), the Finder results description looks like this:

'H yperT ext M arkup L anguage' - see the extra spaces.

Line 41 needs to be changed to take into account tags, removed or strip_tags() needs to occure first.

Because this isn't straightforward to fix and would need discussion or a decision by a Joomla dev, I can't really raise a PR, but please let me know if there's anything else you want me to do.

Thanks,
Andy

avatar andykirk andykirk - open - 23 Sep 2014
avatar brianteeman brianteeman - change - 23 Sep 2014
Category Components Components Search
avatar Kubik-Rubik
Kubik-Rubik - comment - 24 Sep 2014

Hello Andy,

thank you taking the time to create this ticket. I have analyzed the code and have to give your right.

How or where do you get the description with the messed up code in the output? In the smart search component the field for the description in the database is only 255 characters long, so only the first 255 characters are saved for the output. The problematic area after 2000 + x characters is not considered at all. Nevertheless, this bug should be fixed in the function process in the html.php file. I would suggest to add an additional check whether some HTML parts are still present.

I can also confirm the second issue with the example text that you've posted.

Summary: We should tackle it! :-)

This comment was created with the J!Tracker Application at http://issues.joomla.org/.

avatar Kubik-Rubik Kubik-Rubik - change - 24 Sep 2014
Status New Confirmed
avatar andykirk
andykirk - comment - 25 Sep 2014

Hi,
I'm seeing this on the Smart Search results page.
Live example: https://www.npeu.ox.ac.uk/search?q=graces+training&Search=
Description included remnants of a <video>.

This chunking is all happening before tags are stripped and before the 255 character truncation.
The process is going like this:

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

@andykirk Yes, it makes totally sense. As I suggested, we should improve the check whether some HTML parts are still present and remove them to avoid such an output!

It is on my to do list. :-)

avatar andykirk
andykirk - comment - 25 Sep 2014

If it helps, I created a truncate HTML function ages ago.
I've just put it up as a gist:

https://gist.github.com/andykirk/b304a3c84594515677e6

I'm sure it could do with a lot of improvement but I thought I'd share in case it was useful.

Thanks,

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

@andykirk Wow, too much code... :-) I just wanted to add a simple check whether an angle bracket is present and remove the part from the beginning to this bracket. This should solve the first issue completely.

avatar andykirk
andykirk - comment - 25 Sep 2014

That's totally fine. As I said - I just put the truncate function up in case it was useful.
Actually, thinking about it - it probably won't be useful, as you could still end up with HTML showing when the 255 limit kicks in.

avatar andykirk
andykirk - comment - 25 Sep 2014

FWIW - still think using strip_tags before the string is counted would be a more bullet-proof way of solving this.
I.e.

$input = strip_tags($input);
Above:
if (strlen($input) > 2048)

Cheers

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

Yes, this is of course the easiest solution but we have the HTML parser exactly for that reason, so the processing should be in the corresponding parser class (parser/html.php).

avatar andykirk
andykirk - comment - 25 Sep 2014

Yes, of course, I understand that, but parser.php makes assumptions about the character length and chunks it before passing it to parser/html.php parse().
I'm not 100% sure I fully understand how your suggested fix will work, but I've got a feeling it may still be problematic, especially if you do that before strip tags, but I could be wrong :-)

One idea could be to add a pre_process() method parser/html.php which is simply:

return strip_tags($input);

Then parser.php could check for the existence of this method before establishing the character count:

...
$return = null;
/*add this >>>*/ if (method_exists('pre_process', $this))
{
    $input = $this->pre_process($input);
}

// Parse the input in batches if bigger than 2KB.
if (strlen($input) > 2048)
{
...

As I say, just an idea. :-)

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

@andykirk :+1: for your idea! :-)

avatar andykirk
andykirk - comment - 25 Sep 2014

True.
And thanks :-)

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

Please check out the fix in #4345

I decided to go with a change only in the HTML parser.

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

I will close this issue here, please go on in the linked PR with the discussion.

Thank you for reporting this, @andykirk!

This comment was created with the J!Tracker Application at http://issues.joomla.org/.

avatar jissues-bot jissues-bot - close - 25 Sep 2014
avatar zero-24 zero-24 - close - 25 Sep 2014
avatar Kubik-Rubik Kubik-Rubik - change - 25 Sep 2014
Status Confirmed Closed
avatar jissues-bot
jissues-bot - comment - 25 Sep 2014

Set to "closed" on behalf of @Kubik-Rubik by The JTracker Application at issues.joomla.org

avatar jissues-bot jissues-bot - change - 25 Sep 2014
Closed_Date 0000-00-00 00:00:00 2014-09-25 12:50:13
avatar andykirk
andykirk - comment - 25 Sep 2014

Hi, thanks for the fix. However, I still anticipate a problem (though I'm unable to test at the moment).

Firstly, the fix makes this line redundant:
$input = str_replace('>', '> ', $input);
since it's all just being deleted anyway.

Secondly, as per my original example:
'Lorum ipsum ...' + 2006 more characters + 'dolor A thing ...'

the chunk of text that's passed to the parser would be:

'Lorum ipsum ...' + 2006 more characters + 'dolor <abbr title="Some'

So maybe you meant to < in the fix and not > to remove the tag remnant?

Cheers

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

Yes, this line is obsolete with the fix. You are right, the cut could also lead to an open bracket. So we should check both possibilities (in the beginning or in the end of the string).

avatar Kubik-Rubik
Kubik-Rubik - comment - 25 Sep 2014

Please test again, thanks!

avatar andykirk
andykirk - comment - 25 Sep 2014

That's great. I can't test until tomorrow, but it looks good to me :-)

Thanks for getting on with this so quickly.

avatar KISS-Web-Design
KISS-Web-Design - comment - 25 Sep 2014

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a temporary error. The following address(es) deferred:

chris.jones-gill@ntlworld.com
Domain kisswebdesign.co.uk has exceeded the max emails per hour (100/100 (100%)) allowed. Message will be reattempted later

------- This is a copy of the message, including all the headers. ------
Received: from github-smtp2-ext7.iad.github.net ([192.30.252.198]:52786 helo=github-smtp2a-ext-cp1-prd.iad.github.net)
by duffman.enixns.com with esmtps (TLSv1:DHE-RSA-AES256-SHA:256)
(Exim 4.82)
(envelope-from noreply@github.com)
id 1XXCHv-000Z75-GX
for chris.jones-gill@kisswebdesign.co.uk; Thu, 25 Sep 2014 17:53:03 +0100
Date: Thu, 25 Sep 2014 06:53:15 -0700
From: andykirk notifications@github.com
Reply-To: joomla/joomla-cms reply@reply.github.com
To: joomla/joomla-cms joomla-cms@noreply.github.com
Message-ID:
In-Reply-To:
References:
Subject: Re: [joomla-cms] Finder: Indexer Parsers break search result
descriptions (#4327)
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_54241e4b2f3d9_2d6f3fff4d9bd2a016244a";
charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Recipient: KISS-Web-Design
List-ID: joomla/joomla-cms
List-Archive: https://github.com/joomla/joomla-cms
List-Post: reply@reply.github.com
List-Unsubscribe: unsub+i-43643858-5e9236493a56ff9b0b0e8b4fcdb6b10ddb91e411-1520943@reply.github.com,
https://github.com/notifications/unsubscribe/1520943__eyJzY29wZSI6Ik5ld3NpZXM6TXV0ZSIsImV4cGlyZXMiOjE3MjcyNzIzOTUsImRhdGEiOnsiaWQiOjQzNDMyOTQzfX0=--5c2315b4935ad02abdb3d8b743b9f5433e18ed0f
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: chris.jones-gill@kisswebdesign.co.uk
X-enixltd-MailScanner-Information: Please contact the ISP for more information
X-enixltd-MailScanner-ID: 1XXCHv-000Z75-GX
X-enixltd-MailScanner: Found to be clean
X-enixltd-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
score=-4.84, required 5, autolearn=not spam, BAYES_00 -1.90,
HTML_IMAGE_ONLY_12 2.06, HTML_MESSAGE 0.00, RCVD_IN_DNSWL_HI -5.00,
SPF_PASS -0.00, URIBL_BLOCKED 0.00)
X-enixltd-MailScanner-From: noreply@github.com
X-Spam-Status: No

----==_mimepart_54241e4b2f3d9_2d6f3fff4d9bd2a016244a
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 7bit

That's great. I can't test until tomorrow, but it looks good to me :-)

Thanks for getting on with this so quickly.


Reply to this email directly or view it on GitHub:
#4327 (comment)
----==_mimepart_54241e4b2f3d9_2d6f3fff4d9bd2a016244a
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: 7bit

That's great. I can't test until tomorrow, but it looks good to me :-)

Thanks for getting on with this so quickly.


Reply to this email directly or view it on GitHub.

----==_mimepart_54241e4b2f3d9_2d6f3fff4d9bd2a016244a--

avatar KISS-Web-Design
KISS-Web-Design - comment - 25 Sep 2014

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a temporary error. The following address(es) deferred:

chris.jones-gill@ntlworld.com
Domain kisswebdesign.co.uk has exceeded the max emails per hour (101/100 (101%)) allowed. Message will be reattempted later

------- This is a copy of the message, including all the headers. ------
Received: from github-smtp2-ext8.iad.github.net ([192.30.252.199]:33352 helo=github-smtp2b-ext-cp1-prd.iad.github.net)
by duffman.enixns.com with esmtps (TLSv1:DHE-RSA-AES256-SHA:256)
(Exim 4.82)
(envelope-from noreply@github.com)
id 1XXCJ7-000ZXU-U7
for chris.jones-gill@kisswebdesign.co.uk; Thu, 25 Sep 2014 17:54:18 +0100
Date: Thu, 25 Sep 2014 02:47:36 -0700
From: Viktor Vogel notifications@github.com
Reply-To: joomla/joomla-cms reply@reply.github.com
To: joomla/joomla-cms joomla-cms@noreply.github.com
Message-ID:
In-Reply-To:
References:
Subject: Re: [joomla-cms] Finder: Indexer Parsers break search result
descriptions (#4327)
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_5423e4b85688_75723fceaf26f29c7660dc";
charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Recipient: KISS-Web-Design
List-ID: joomla/joomla-cms
List-Archive: https://github.com/joomla/joomla-cms
List-Post: reply@reply.github.com
List-Unsubscribe: unsub+i-43643858-5e9236493a56ff9b0b0e8b4fcdb6b10ddb91e411-1520943@reply.github.com,
https://github.com/notifications/unsubscribe/1520943__eyJzY29wZSI6Ik5ld3NpZXM6TXV0ZSIsImV4cGlyZXMiOjE3MjcyNTc2NTYsImRhdGEiOnsiaWQiOjQzNDMyOTQzfX0=--b6c19bc70c3272f74f7962a51cd28db128fe21b1
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: chris.jones-gill@kisswebdesign.co.uk
X-enixltd-MailScanner-Information: Please contact the ISP for more information
X-enixltd-MailScanner-ID: 1XXCJ7-000ZXU-U7
X-enixltd-MailScanner: Found to be clean
X-enixltd-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
score=-3.297, required 5, autolearn=not spam, BAYES_00 -1.90,
DATE_IN_PAST_06_12 1.54, HTML_IMAGE_ONLY_12 2.06, HTML_MESSAGE 0.00,
RCVD_IN_DNSWL_HI -5.00, SPF_PASS -0.00, URIBL_BLOCKED 0.00)
X-enixltd-MailScanner-From: noreply@github.com
X-Spam-Status: No

----==_mimepart_5423e4b85688_75723fceaf26f29c7660dc
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 7bit

@andykirk Yes, it makes totally sense. As I suggested, we should improve the check whether some HTML parts are still present and remove them to avoid such an output!

It is on my to do list. :-)


Reply to this email directly or view it on GitHub:
#4327 (comment)
----==_mimepart_5423e4b85688_75723fceaf26f29c7660dc
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: 7bit

@andykirk Yes, it makes totally sense. As I suggested, we should improve the check whether some HTML parts are still present and remove them to avoid such an output!

It is on my to do list. :-)


Reply to this email directly or view it on GitHub.

----==_mimepart_5423e4b85688_75723fceaf26f29c7660dc--

avatar KISS-Web-Design
KISS-Web-Design - comment - 25 Sep 2014

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a temporary error. The following address(es) deferred:

chris.jones-gill@ntlworld.com
Domain kisswebdesign.co.uk has exceeded the max emails per hour (106/100 (106%)) allowed. Message will be reattempted later

------- This is a copy of the message, including all the headers. ------
Received: from github-smtp2-ext3.iad.github.net ([192.30.252.194]:49107 helo=github-smtp2b-ext-cp1-prd.iad.github.net)
by duffman.enixns.com with esmtps (TLSv1:DHE-RSA-AES256-SHA:256)
(Exim 4.82)
(envelope-from noreply@github.com)
id 1XXCNx-000bAS-EO
for chris.jones-gill@kisswebdesign.co.uk; Thu, 25 Sep 2014 17:59:17 +0100
Date: Thu, 25 Sep 2014 02:50:41 -0700
From: andykirk notifications@github.com
Reply-To: joomla/joomla-cms reply@reply.github.com
To: joomla/joomla-cms joomla-cms@noreply.github.com
Message-ID:
In-Reply-To:
References:
Subject: Re: [joomla-cms] Finder: Indexer Parsers break search result
descriptions (#4327)
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_5423e5714eb5e_3eba3fa92153d2c014702dc";
charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Recipient: KISS-Web-Design
List-ID: joomla/joomla-cms
List-Archive: https://github.com/joomla/joomla-cms
List-Post: reply@reply.github.com
List-Unsubscribe: unsub+i-43643858-5e9236493a56ff9b0b0e8b4fcdb6b10ddb91e411-1520943@reply.github.com,
https://github.com/notifications/unsubscribe/1520943__eyJzY29wZSI6Ik5ld3NpZXM6TXV0ZSIsImV4cGlyZXMiOjE3MjcyNTc4NDEsImRhdGEiOnsiaWQiOjQzNDMyOTQzfX0=--3c43695b1f41fe2e203f851c391a57672590a5d7
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: chris.jones-gill@kisswebdesign.co.uk
X-enixltd-MailScanner-Information: Please contact the ISP for more information
X-enixltd-MailScanner-ID: 1XXCNx-000bAS-EO
X-enixltd-MailScanner: Found to be clean
X-enixltd-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
score=-3.297, required 5, autolearn=not spam, BAYES_00 -1.90,
DATE_IN_PAST_06_12 1.54, HTML_IMAGE_ONLY_12 2.06, HTML_MESSAGE 0.00,
RCVD_IN_DNSWL_HI -5.00, SPF_PASS -0.00, URIBL_BLOCKED 0.00)
X-enixltd-MailScanner-From: noreply@github.com
X-Spam-Status: No

----==_mimepart_5423e5714eb5e_3eba3fa92153d2c014702dc
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 7bit

If it helps, I created a truncate HTML function ages ago.
I've just put it up as a gist:

https://gist.github.com/andykirk/b304a3c84594515677e6

I'm sure it could do with a lot of improvement but I thought I'd share in case it was useful.

Thanks,


Reply to this email directly or view it on GitHub:
#4327 (comment)
----==_mimepart_5423e5714eb5e_3eba3fa92153d2c014702dc
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: 7bit

If it helps, I created a truncate HTML function ages ago.
I've just put it up as a gist:

https://gist.github.com/andykirk/b304a3c84594515677e6

I'm sure it could do with a lot of improvement but I thought I'd share in case it was useful.

Thanks,


Reply to this email directly or view it on GitHub.

----==_mimepart_5423e5714eb5e_3eba3fa92153d2c014702dc--

avatar KISS-Web-Design
KISS-Web-Design - comment - 25 Sep 2014

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a temporary error. The following address(es) deferred:

chris.jones-gill@ntlworld.com
Domain kisswebdesign.co.uk has exceeded the max emails per hour (108/100 (108%)) allowed. Message will be reattempted later

------- This is a copy of the message, including all the headers. ------
Received: from github-smtp2-ext2.iad.github.net ([192.30.252.193]:49632 helo=github-smtp2b-ext-cp1-prd.iad.github.net)
by duffman.enixns.com with esmtps (TLSv1:DHE-RSA-AES256-SHA:256)
(Exim 4.82)
(envelope-from noreply@github.com)
id 1XXCNy-000bAv-Jq
for chris.jones-gill@kisswebdesign.co.uk; Thu, 25 Sep 2014 17:59:18 +0100
Date: Thu, 25 Sep 2014 04:30:10 -0700
From: andykirk notifications@github.com
Reply-To: joomla/joomla-cms reply@reply.github.com
To: joomla/joomla-cms joomla-cms@noreply.github.com
Message-ID:
In-Reply-To:
References:
Subject: Re: [joomla-cms] Finder: Indexer Parsers break search result
descriptions (#4327)
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_5423fcc233b7f_5da63f957334b2bc630938";
charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Recipient: KISS-Web-Design
List-ID: joomla/joomla-cms
List-Archive: https://github.com/joomla/joomla-cms
List-Post: reply@reply.github.com
List-Unsubscribe: unsub+i-43643858-5e9236493a56ff9b0b0e8b4fcdb6b10ddb91e411-1520943@reply.github.com,
https://github.com/notifications/unsubscribe/1520943__eyJzY29wZSI6Ik5ld3NpZXM6TXV0ZSIsImV4cGlyZXMiOjE3MjcyNjM4MTAsImRhdGEiOnsiaWQiOjQzNDMyOTQzfX0=--b09bfdc00440da35560a53ad8d7fa543a4b0e61f
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: chris.jones-gill@kisswebdesign.co.uk
X-enixltd-MailScanner-Information: Please contact the ISP for more information
X-enixltd-MailScanner-ID: 1XXCNy-000bAv-Jq
X-enixltd-MailScanner: Found to be clean
X-enixltd-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
score=-3.761, required 5, autolearn=not spam, BAYES_00 -1.90,
DATE_IN_PAST_03_06 1.59, HTML_IMAGE_ONLY_20 1.55, HTML_MESSAGE 0.00,
RCVD_IN_DNSWL_HI -5.00, SPF_PASS -0.00, URIBL_BLOCKED 0.00)
X-enixltd-MailScanner-From: noreply@github.com
X-Spam-Status: No

----==_mimepart_5423fcc233b7f_5da63f957334b2bc630938
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 7bit

Yes, of course, I understand that, but parser.php makes assumptions about the character length and chunks it before passing it to parser/html.php parse().
I'm not 100% sure I fully understand how your suggested fix will work, but I've got a feeling it may still be problematic, especially if you do that before strip tags, but I could be wrong :-)

One idea could be to add a pre_process() method parser/html.php which is simply:

return strip_tags($input);

Then parser.php could check for the existence of this method before establishing the character count:

...
$return = null;
/*add this >>>*/ if (method_exists('pre_process', $this))
{
    $input = $this->pre_process($input);
}

// Parse the input in batches if bigger than 2KB.
if (strlen($input) > 2048)
{
...

As I say, just an idea. :-)


Reply to this email directly or view it on GitHub:
#4327 (comment)
----==_mimepart_5423fcc233b7f_5da63f957334b2bc630938
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: 7bit

Yes, of course, I understand that, but parser.php makes assumptions about the character length and chunks it before passing it to parser/html.php parse().
I'm not 100% sure I fully understand how your suggested fix will work, but I've got a feeling it may still be problematic, especially if you do that before strip tags, but I could be wrong :-)

One idea could be to add a pre_process() method parser/html.php which is simply:

return strip_tags($input);

Then parser.php could check for the existence of this method before establishing the character count:

...
$return = null;
/*add this >>>*/ if (method_exists('pre_process', $this))
{
    $input = $this->pre_process($input);
}

// Parse the input in batches if bigger than 2KB.
if (strlen($input) > 2048)
{
...

As I say, just an idea. :-)


Reply to this email directly or view it on GitHub.

----==_mimepart_5423fcc233b7f_5da63f957334b2bc630938--

avatar andykirk
andykirk - comment - 26 Sep 2014

Hi,
Applied fix and re-index my dev site and the results are now as expected, so I can confirm the fix works.

Many thanks. :-)

avatar zero-24 zero-24 - change - 7 Jul 2015
Labels Added: ?

Add a Comment

Login with GitHub to post a comment