administrator/components/com_finder/helpers/indexer/parser/html.php, line 63:
$input = str_replace('>', '> ', $input)
This is too aggressive. I discovered a case where one of my articles contains the text for an expanded acronym using <b>
tags to highlight the relevant letters. E.g. (example from Wikipedia):
the onset of congestive heart failure (CHF)
In the search output after indexing this becomes:
the onset of c ongestive h eart f ailure (CHF)
Which is broken. The comment in the code describes a problem with adjacent block tags:
// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.
With that in mind I propose a solution that only adds the space after block tags, not all tags:
$block_els = 'address|article|aside|blockquote|canvas|dd|div|dl|fieldset|figcaption|figure|footer|form|h1|h2|h3|h4|h5|h6|header|hgroup|hr|main|nav|noscript|ol|output|p|pre|section|table|tfoot|ul|video';
$input = preg_replace('#</(' . $block_els . ')><#', '</$1> <', $input);
Or that's too much, just add a space between adjacent tags instead of putting a space after all tags:
$input = str_replace('><', '> <', $input)
I'm happy to discuss further or raise a PR if needed.
Thanks,
Andy
Labels |
Added:
?
|
Category | ⇒ | Administration |
Using preg_replace should be avoided because regular expressions are very slow. Your str_replace idea looks reasonable.
Category | Administration | ⇒ | Administration Search |
Hi @andykirk you created this issue sometime ago but have not provided a PR with test instructions for people to evaluate so I'm closing the issue at this time. If code is provided it can always be re-examined. Thanks for raising the issue for consideration.
Status | New | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2016-05-08 10:10:41 |
Closed_By | ⇒ | chrisdavenport |
Closed_Date | 2016-05-08 10:10:41 | ⇒ | 2016-05-08 10:10:42 |
Closed_By | chrisdavenport | ⇒ | joomla-cms-bot |
Set to "closed" on behalf of @chrisdavenport by The JTracker Application at issues.joomla.org/joomla-cms/7927
Hi,
I didn't feel qualified to make a PR for this - I was hoping for more discussion or input from others first.
The issue still persists though.
Steps to reproduce:
Fix:
Change administrator/components/com_finder/helpers/indexer/parser/html.php, line 63 from $input = str_replace('>', '> ', $input);
to $input = str_replace('><', '> <', $input);
Verify:
Does that help?
I'm having to hack the core at the moment,
I can raise a PR for this line if it's acceptable.
Thanks.
Just to be clear your text has the first letter in bold and that is what is
causing the problem?
If that's correct then please submit the pr with the code you suggest and
it can be quickly tested.
Hi,
It's not that specific - it'll happen for any cases where any tag appears anywhere in a word to differentiate one or more letters that are a part of that word. Currently they'll be split into separate words.
Another example:
Glibenclamide and metfoRmin versus stAndard Care in gEstational diabeteS
(<i>G</i>libenclamide and metfo<i>R</i>min versus st<i>A</i>ndard <i>C</i>are in g<i>E</i>stational diabete<i>S</i>
)
Appears in the search results as:
G libenclamide and metfo R min versus st A ndard C are in g E stational diabete S
The change should fix this example too but I'm unable to test properly at the moment.
Should I still raise the PR?
I can confirm the change fixes the 2nd example. However, it's still not completely robust as it won't fix double-tags - e.g.:
<b><i>c</i></b>ongestive <b><i>h</i></b>eart <b><i>f</i></b>ailure
the presence of the ><
above causes spaces to remain.
These things are edge-cases, but I feel there should be a robust solution to this, since it can cause words that may be very important to be unexpectedly unindexable.
Before submitting a PR I think it would be useful to write down a set of test cases with the expected results. I can see that there could be edge cases which will be difficult to fix without introducing regressions, so we need some good documented tests (which can be manual tests) to keep the lid on that. For example,
<h1>Title</h1><p>Paragraph</p>
should produce Title Paragraph
metfo<i>R</i>min
should produce metfoRmin
<b><i>c</i></b>ongestive
should produce congestive
Can you think of any others?
I did at one point crazily build some unit tests for the FinderIndexer library. So add the test cases there and you should be able to not lose them.
Status | Closed | ⇒ | New |
Closed_Date | 2016-05-08 10:10:41 | ⇒ | |
Closed_By | joomla-cms-bot | ⇒ |
Oh yeah, I'd forgotten about those tests. Thanks for the reminder.
@chrisdavenport - I can't think of any other variations exactly - but due to the nature of HTML, it's technically possible there could be any number of tags wrapping the letter(s), so the solution shouldn't be limited to a maximum of two.
E.g.:
<em><strong><s><u>LETTERS</u></s></strong></em>
Ideally this sort of thing shouldn't happen, of course, but it's possible.
I'm wondering about using strip_tags()
twice and using the 2nd argument on the first occurrence to allow all block tags through before running the block spacing fix.
So perhaps something like:
// Strips all inline tags:
$input = strip_tags($input, '<address><article><aside><blockquote><canvas><dd><div><dl><fieldset><figcaption><figure><footer><form><h1><h2><h3><h4><h5><h6><header><hgroup><hr><main><nav><noscript><ol><output><p><pre><section><table><tfoot><ul><video>');
// This fixes issues such as '<h1>Title</h1><p>Paragraph</p>'
// being transformed into 'TitleParagraph' with no space.
$input = str_replace('><', '> <', $input);
// Strip all block tags:
$input = strip_tags($input);
I realise this is similar to my regex solution I first posted, and I don't know if it's quicker or not, but I can't think of any other reliable way at the moment.
Status | New | ⇒ | Discussion |
Status | Discussion | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2017-05-20 20:40:36 |
Closed_By | ⇒ | brianteeman |
Thanks.
I'll test as soon as I can, though your solution is the same as one I originally proposed (and tested).
It worked for single-tag cases, but not multiple tag cases.
It's better, but not perfect...
Understood (and appreciated), and I'd certainly rather have this in the core than nothing.
I just wanted to point out that it doesn't cover all cases, and the only solution I can think of was what I proposed most recently (to strip all inline tags in the first instance).
I understand that that wasn't ideal either, though...
As I say, I'll just double-check your PR as soon as I can though :-)
Thanks again.
At first glance it seems to make sense that it is only for block elements
On 22 Sep 2015 10:43 am, "Andy Kirk" notifications@github.com wrote: