Joomla! Issue Tracker | Joomla! CMS #7927 - com_finder: FinderIndexerParserHtml: parse() too aggressive

Closed
20 May 2017
Medium
Build: master
# 7927

andykirk
22 Sep 2015

administrator/components/com_finder/helpers/indexer/parser/html.php, line 63:

$input = str_replace('>', '> ', $input)

This is too aggressive. I discovered a case where one of my articles contains the text for an expanded acronym using  tags to highlight the relevant letters. E.g. (example from Wikipedia):

the onset of congestive heart failure (CHF)

In the search output after indexing this becomes:

the onset of c ongestive h eart f ailure (CHF)

Which is broken. The comment in the code describes a problem with adjacent block tags:

// This fixes issues such as '<h1>Title</h1>Paragraph'
// being transformed into 'TitleParagraph' with no space.

With that in mind I propose a solution that only adds the space after block tags, not all tags:

$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);

Or that's too much, just add a space between adjacent tags instead of putting a space after all tags:
$input = str_replace('><', '> <', $input)

I'm happy to discuss further or raise a PR if needed.

Thanks,
Andy

andykirk - open - 22 Sep 2015

brianteeman - comment - 22 Sep 2015

At first glance it seems to make sense that it is only for block elements
On 22 Sep 2015 10:43 am, "Andy Kirk" notifications@github.com wrote:

administrator/components/com_finder/helpers/indexer/parser/html.php, line
63:

$input = str_replace('>', '> ', $input)

This is too aggressive. I discovered a case where one of my articles
contains the text for an expanded acronym using tags to highlight the
relevant letters. E.g. (example from Wikipedia
https://en.wikipedia.org/wiki/Acronym):

the onset of congestive heart failure (CHF)

In the search output after indexing this becomes:

the onset of c ongestive h eart f ailure (CHF)

Which is broken. The comment in the code describes a problem with adjacent
block tags:

// This fixes issues such as '

Title

Paragraph
'
// being transformed into 'TitleParagraph' with no space.

With that in mind I propose a solution that only adds the space after
block tags, not all tags:

$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);

Or that's too much, just add a space between adjacent tags instead of
putting a space after all tags:
$input = str_replace('><', '> <', $input)

I'm happy to discuss further or raise a PR if needed.

Thanks,
Andy

—
Reply to this email directly or view it on GitHub
#7927.

zero-24 - change - 23 Sep 2015

Labels

Added: ?

zero-24 - change - 23 Sep 2015

Category

Administration

⇒

Administration Search

chrisdavenport - comment - 8 May 2016

Hi @andykirk you created this issue sometime ago but have not provided a PR with test instructions for people to evaluate so I'm closing the issue at this time. If code is provided it can always be re-examined. Thanks for raising the issue for consideration.

_{This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7927.}

chrisdavenport - change - 8 May 2016

Status	New	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2016-05-08 10:10:41
Closed_By		⇒	chrisdavenport

joomla-cms-bot - change - 8 May 2016

Closed_Date	2016-05-08 10:10:41	⇒	2016-05-08 10:10:42
Closed_By	chrisdavenport	⇒	joomla-cms-bot

joomla-cms-bot - close - 8 May 2016

chrisdavenport - comment - 8 May 2016

Set to "closed" on behalf of @chrisdavenport by The JTracker Application at issues.joomla.org/joomla-cms/7927

joomla-cms-bot - close - 8 May 2016

andykirk - comment - 13 May 2016

Hi,

I didn't feel qualified to make a PR for this - I was hoping for more discussion or input from others first.

The issue still persists though.

Steps to reproduce:

Create an article with this HTML content `congestive heart failure (CHF).
Run a search index in Smart Search
Search for "congestive heart failure" - you'll get no results
Search for "CHF" and you'll get the result for that article, but it'll show the text as "c ongestive h eart f ailure (CHF")

Fix:

Change administrator/components/com_finder/helpers/indexer/parser/html.php, line 63 from $input = str_replace('>', '> ', $input); to $input = str_replace('><', '> <', $input);

Verify:

Clear search index then re-run search index.
Search for "congestive heart failure" - you should now get the expected result.

Does that help?

I'm having to hack the core at the moment,

I can raise a PR for this line if it's acceptable.

Thanks.

brianteeman - comment - 13 May 2016

Just to be clear your text has the first letter in bold and that is what is
causing the problem?

If that's correct then please submit the pr with the code you suggest and
it can be quickly tested.

andykirk - comment - 13 May 2016

Hi,

It's not that specific - it'll happen for any cases where any tag appears anywhere in a word to differentiate one or more letters that are a part of that word. Currently they'll be split into separate words.

Another example:

Glibenclamide and metfoRmin versus stAndard Care in gEstational diabeteS
(Glibenclamide and metfoRmin versus stAndard Care in gEstational diabeteS)

Appears in the search results as:
G libenclamide and metfo R min versus st A ndard C are in g E stational diabete S

The change should fix this example too but I'm unable to test properly at the moment.

Should I still raise the PR?

andykirk - comment - 13 May 2016

I can confirm the change fixes the 2nd example. However, it's still not completely robust as it won't fix double-tags - e.g.:

congestive heart failure

the presence of the >< above causes spaces to remain.

These things are edge-cases, but I feel there should be a robust solution to this, since it can cause words that may be very important to be unexpectedly unindexable.

chrisdavenport - comment - 13 May 2016

Before submitting a PR I think it would be useful to write down a set of test cases with the expected results. I can see that there could be edge cases which will be difficult to fix without introducing regressions, so we need some good documented tests (which can be manual tests) to keep the lid on that. For example,

<h1>Title</h1>Paragraph should produce Title Paragraph
metfoRmin should produce metfoRmin
congestive should produce congestive

Can you think of any others?

mbabker - comment - 13 May 2016

https://github.com/joomla/joomla-cms/blob/staging/tests/unit/suites/finderIndexer/parser/FinderIndexerParserHtmlTest.php

I did at one point crazily build some unit tests for the FinderIndexer library. So add the test cases there and you should be able to not lose them.

brianteeman - change - 13 May 2016

Status	Closed	⇒	New
Closed_Date	2016-05-08 10:10:41	⇒
Closed_By	joomla-cms-bot	⇒

brianteeman - reopen - 13 May 2016

chrisdavenport - comment - 13 May 2016

Oh yeah, I'd forgotten about those tests. Thanks for the reminder.

andykirk - comment - 16 May 2016

@chrisdavenport - I can't think of any other variations exactly - but due to the nature of HTML, it's technically possible there could be any number of tags wrapping the letter(s), so the solution shouldn't be limited to a maximum of two.

E.g.:
<s>LETTERS</s>

Ideally this sort of thing shouldn't happen, of course, but it's possible.

I'm wondering about using strip_tags() twice and using the 2nd argument on the first occurrence to allow all block tags through before running the block spacing fix.

So perhaps something like:

// Strips all inline tags:
$input = strip_tags($input, '<address><article><aside><blockquote><canvas><dd><div><dl><fieldset><figcaption><figure><footer><form><h1><h2><h3><h4><h5><h6><header><hgroup><hr><main><nav><noscript><ol><output><p><pre><section><table><tfoot><ul><video>');

// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.
$input = str_replace('><', '> <', $input);

// Strip all block tags:
$input = strip_tags($input);

I realise this is similar to my regex solution I first posted, and I don't know if it's quicker or not, but I can't think of any other reliable way at the moment.

joomla-cms-bot - change - 26 May 2016

The description was changed

joomla-cms-bot - edited - 26 May 2016

franz-wohlkoenig - change - 4 Apr 2017

Status

New

⇒

Discussion

brianteeman - comment - 20 May 2017

I created a PR for testing #16159

brianteeman - change - 20 May 2017

Status	Discussion	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2017-05-20 20:40:36
Closed_By		⇒	brianteeman

brianteeman - close - 20 May 2017

andykirk - comment - 20 May 2017

Thanks.
I'll test as soon as I can, though your solution is the same as one I originally proposed (and tested).
It worked for single-tag cases, but not multiple tag cases.
It's better, but not perfect...

brianteeman - comment - 20 May 2017

@andykirk I just want to move it on from sitting here for 12 months with no action to actually getting solved

andykirk - comment - 20 May 2017

Understood (and appreciated), and I'd certainly rather have this in the core than nothing.

I just wanted to point out that it doesn't cover all cases, and the only solution I can think of was what I proposed most recently (to strip all inline tags in the first instance).

I understand that that wasn't ideal either, though...

As I say, I'll just double-check your PR as soon as I can though :-)
Thanks again.

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#7927] - com_finder: FinderIndexerParserHtml: parse() too aggressive

Title

Add a Comment