User tests: Successful: Unsuccessful:
Smart Search has been donated at some point to the project by a commercial company, but I'm pretty sure that the commercial company wasn't the original developer of the system. There are things in the system which feel like it is the child of at least 2 completely different developers, which resulted in inconsistencies which partially haven't been solved even today. One of them is the support for different parsers for the content to index.
The indexer class supports a method parameter to select different parsers for the content, so that you could for example parse plain text, html, RTF documents or basically everything else you can think of. However, this parameter applies to all properties of a Result object, which is a problem when you have HTML content in a description for example and a PDF (or RTF) in another property. (Think about a document manager.)
This PR implements a new parameter to the Result::addInstruction()
method to select a parser to read the property with. Right now this parameter supports txt
, html
and rtf
, but additional parsers for example for PDF or docx are possible. (Especially for docx it should be considered if this has to be part of Joomla core. I would be happy with just PDF for now.) This PR also fixes an issue where the memory_table_limit
seems to have been reverted to a wrong value during an upmerge and it raises the chunk size for reading data from 2KiB to 32KiB. While I would even question if 2KiB would have been the right value in 2012 when this was added to Joomla, going to 32KiB today is still playing this VERY safe. However, cutting it up into such small chunks also means that all the rest of the code is run more often than necessary, reducing performance.
The code is backwards compatible and when the index()
method is called with a $format
parameter, that parameter takes precedence over the set instructions, expecting this to be legacy code which would be unaware of this new feature.
Please find attached a testing plugin for Smart Search, which adds one entry to the index and reads an RTF file into the system while doing so. You need to get your own RTF sample file. Extract the attached ZIP to your /plugins/finder
folder and discover the plugin in the backend. Make sure that you have enabled the plugin. Edit /plugins/finder/test/src/Extension/Test.php
and add your demo RTF file, which you want to index in line 141. It is trying to load the filepath from the root of the site. Then click Index
in the Smart Search backend. Afterwards you can search for the content of the RTF in the frontend and should get an entry named Test RTF
when it matches.
test.zip
Documentation will be added soon.
Please select:
Documentation link for docs.joomla.org:
No documentation changes for docs.joomla.org needed
Pull Request link for manual.joomla.org:
No documentation changes for manual.joomla.org needed
Status | New | ⇒ | Pending |
Category | ⇒ | Administration com_finder |
Not exactly. That site does contain RTF files, but they are all just lorem ipsum text. I looked for some public domain books to parse here and came across (obscure versions of) the bible and finally settled on War and Peace
from Tolstoy. I didn't list a source for RTF files because I didn't just want several people to test this with just one specific file.
Struggling to see why this should be added to the core
Because it has been part of core since 2.5.0, just it was broken all the time.
Surely thats an indicator that it should be removed if anything at all
This pull request has been automatically rebased to 5.3-dev.
Title |
|
A good source for a sample rtf file is https://file-examples.com/index.php/sample-documents-download/sample-rtf-download/