?
avatar andykirk
andykirk
22 Sep 2015

administrator/components/com_finder/helpers/indexer/parser/html.php, line 63:

$input = str_replace('>', '> ', $input)

This is too aggressive. I discovered a case where one of my articles contains the text for an expanded acronym using <b> tags to highlight the relevant letters. E.g. (example from Wikipedia):

the onset of congestive heart failure (CHF)

In the search output after indexing this becomes:

the onset of c ongestive h eart f ailure (CHF)

Which is broken. The comment in the code describes a problem with adjacent block tags:

// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.

With that in mind I propose a solution that only adds the space after block tags, not all tags:

$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);

Or that's too much, just add a space between adjacent tags instead of putting a space after all tags:
$input = str_replace('><', '> <', $input)

I'm happy to discuss further or raise a PR if needed.

Thanks,
Andy

avatar andykirk andykirk - open - 22 Sep 2015
avatar brianteeman
brianteeman - comment - 22 Sep 2015

At first glance it seems to make sense that it is only for block elements
On 22 Sep 2015 10:43 am, "Andy Kirk" notifications@github.com wrote:

administrator/components/com_finder/helpers/indexer/parser/html.php, line
63:

$input = str_replace('>', '> ', $input)

This is too aggressive. I discovered a case where one of my articles
contains the text for an expanded acronym using tags to highlight the
relevant letters. E.g. (example from Wikipedia
https://en.wikipedia.org/wiki/Acronym):

the onset of congestive heart failure (CHF)

In the search output after indexing this becomes:

the onset of c ongestive h eart f ailure (CHF)

Which is broken. The comment in the code describes a problem with adjacent
block tags:

// This fixes issues such as '

Title

Paragraph

'
// being transformed into 'TitleParagraph' with no space.

With that in mind I propose a solution that only adds the space after
block tags, not all tags:

$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);

Or that's too much, just add a space between adjacent tags instead of
putting a space after all tags:
$input = str_replace('><', '> <', $input)

I'm happy to discuss further or raise a PR if needed.

Thanks,
Andy


Reply to this email directly or view it on GitHub
#7927.

avatar zero-24 zero-24 - change - 23 Sep 2015
Labels Added: ?
avatar zero-24 zero-24 - change - 23 Sep 2015
Category Administration
avatar chrisdavenport
chrisdavenport - comment - 24 Sep 2015

Using preg_replace should be avoided because regular expressions are very slow. Your str_replace idea looks reasonable.

avatar brianteeman brianteeman - change - 28 Apr 2016
Category Administration Administration Search
avatar chrisdavenport
chrisdavenport - comment - 8 May 2016

Hi @andykirk you created this issue sometime ago but have not provided a PR with test instructions for people to evaluate so I'm closing the issue at this time. If code is provided it can always be re-examined. Thanks for raising the issue for consideration.


This comment was created with the J!Tracker Application at issues.joomla.org/joomla-cms/7927.

avatar chrisdavenport chrisdavenport - change - 8 May 2016
Status New Closed
Closed_Date 0000-00-00 00:00:00 2016-05-08 10:10:41
Closed_By chrisdavenport
avatar joomla-cms-bot joomla-cms-bot - change - 8 May 2016
Closed_Date 2016-05-08 10:10:41 2016-05-08 10:10:42
Closed_By chrisdavenport joomla-cms-bot
avatar joomla-cms-bot joomla-cms-bot - close - 8 May 2016
avatar chrisdavenport
chrisdavenport - comment - 8 May 2016
avatar joomla-cms-bot joomla-cms-bot - close - 8 May 2016
avatar andykirk
andykirk - comment - 13 May 2016

Hi,

I didn't feel qualified to make a PR for this - I was hoping for more discussion or input from others first.

The issue still persists though.

Steps to reproduce:

  1. Create an article with this HTML content `congestive heart failure (CHF).
  2. Run a search index in Smart Search
  3. Search for "congestive heart failure" - you'll get no results
  4. Search for "CHF" and you'll get the result for that article, but it'll show the text as "c ongestive h eart f ailure (CHF")

Fix:

Change administrator/components/com_finder/helpers/indexer/parser/html.php, line 63 from $input = str_replace('>', '> ', $input); to $input = str_replace('><', '> <', $input);

Verify:

  1. Clear search index then re-run search index.
  2. Search for "congestive heart failure" - you should now get the expected result.

Does that help?

I'm having to hack the core at the moment,

I can raise a PR for this line if it's acceptable.

Thanks.

avatar brianteeman
brianteeman - comment - 13 May 2016

Just to be clear your text has the first letter in bold and that is what is
causing the problem?

If that's correct then please submit the pr with the code you suggest and
it can be quickly tested.

avatar andykirk
andykirk - comment - 13 May 2016

Hi,

It's not that specific - it'll happen for any cases where any tag appears anywhere in a word to differentiate one or more letters that are a part of that word. Currently they'll be split into separate words.

Another example:

Glibenclamide and metfoRmin versus stAndard Care in gEstational diabeteS
(<i>G</i>libenclamide and metfo<i>R</i>min versus st<i>A</i>ndard <i>C</i>are in g<i>E</i>stational diabete<i>S</i>)

Appears in the search results as:
G libenclamide and metfo R min versus st A ndard C are in g E stational diabete S

The change should fix this example too but I'm unable to test properly at the moment.

Should I still raise the PR?

avatar andykirk
andykirk - comment - 13 May 2016

I can confirm the change fixes the 2nd example. However, it's still not completely robust as it won't fix double-tags - e.g.:

<b><i>c</i></b>ongestive <b><i>h</i></b>eart <b><i>f</i></b>ailure

the presence of the >< above causes spaces to remain.

These things are edge-cases, but I feel there should be a robust solution to this, since it can cause words that may be very important to be unexpectedly unindexable.

avatar chrisdavenport
chrisdavenport - comment - 13 May 2016

Before submitting a PR I think it would be useful to write down a set of test cases with the expected results. I can see that there could be edge cases which will be difficult to fix without introducing regressions, so we need some good documented tests (which can be manual tests) to keep the lid on that. For example,

<h1>Title</h1><p>Paragraph</p> should produce Title Paragraph
metfo<i>R</i>min should produce metfoRmin
<b><i>c</i></b>ongestive should produce congestive

Can you think of any others?

avatar mbabker
mbabker - comment - 13 May 2016

https://github.com/joomla/joomla-cms/blob/staging/tests/unit/suites/finderIndexer/parser/FinderIndexerParserHtmlTest.php

I did at one point crazily build some unit tests for the FinderIndexer library. So add the test cases there and you should be able to not lose them.

avatar brianteeman brianteeman - change - 13 May 2016
Status Closed New
Closed_Date 2016-05-08 10:10:41
Closed_By joomla-cms-bot
avatar brianteeman brianteeman - reopen - 13 May 2016
avatar brianteeman brianteeman - reopen - 13 May 2016
avatar chrisdavenport
chrisdavenport - comment - 13 May 2016

Oh yeah, I'd forgotten about those tests. Thanks for the reminder.

avatar andykirk
andykirk - comment - 16 May 2016

@chrisdavenport - I can't think of any other variations exactly - but due to the nature of HTML, it's technically possible there could be any number of tags wrapping the letter(s), so the solution shouldn't be limited to a maximum of two.

E.g.:
<em><strong><s><u>LETTERS</u></s></strong></em>

Ideally this sort of thing shouldn't happen, of course, but it's possible.

I'm wondering about using strip_tags() twice and using the 2nd argument on the first occurrence to allow all block tags through before running the block spacing fix.

So perhaps something like:

// Strips all inline tags:
$input = strip_tags($input, '<address><article><aside><blockquote><canvas><dd><div><dl><fieldset><figcaption><figure><footer><form><h1><h2><h3><h4><h5><h6><header><hgroup><hr><main><nav><noscript><ol><output><p><pre><section><table><tfoot><ul><video>');

// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.
$input = str_replace('><', '> <', $input);

// Strip all block tags:
$input = strip_tags($input);

I realise this is similar to my regex solution I first posted, and I don't know if it's quicker or not, but I can't think of any other reliable way at the moment.

avatar joomla-cms-bot joomla-cms-bot - change - 26 May 2016
The description was changed
avatar joomla-cms-bot joomla-cms-bot - edited - 26 May 2016
avatar franz-wohlkoenig franz-wohlkoenig - change - 4 Apr 2017
Status New Discussion
avatar brianteeman
brianteeman - comment - 20 May 2017

I created a PR for testing #16159

avatar brianteeman brianteeman - change - 20 May 2017
Status Discussion Closed
Closed_Date 0000-00-00 00:00:00 2017-05-20 20:40:36
Closed_By brianteeman
avatar brianteeman brianteeman - close - 20 May 2017
avatar andykirk
andykirk - comment - 20 May 2017

Thanks.
I'll test as soon as I can, though your solution is the same as one I originally proposed (and tested).
It worked for single-tag cases, but not multiple tag cases.
It's better, but not perfect...

avatar brianteeman
brianteeman - comment - 20 May 2017

@andykirk I just want to move it on from sitting here for 12 months with no action to actually getting solved

avatar andykirk
andykirk - comment - 20 May 2017

Understood (and appreciated), and I'd certainly rather have this in the core than nothing.

I just wanted to point out that it doesn't cover all cases, and the only solution I can think of was what I proposed most recently (to strip all inline tags in the first instance).

I understand that that wasn't ideal either, though...

As I say, I'll just double-check your PR as soon as I can though :-)
Thanks again.

Add a Comment

Login with GitHub to post a comment