MariaDB [lsv5]> SELECT term_id,term,stem FROM ####_finder_terms WHERE term LIKE '14f%' limit 10;
+----------+------------------------------------------------+------------------------------------------------+
| term_id | term | stem |
+----------+------------------------------------------------+------------------------------------------------+
| 33915366 | 14f003e9ab3d7f563dd6ae0898c7a885 | 14f003e9ab3d7f563dd6ae0898c7a885 |
| 33915367 | 14f003e9ab3d7f563dd6ae0898c7a885 http | 14f003e9ab3d7f563dd6ae0898c7a885 http |
| 33915368 | 14f003e9ab3d7f563dd6ae0898c7a885 http security | 14f003e9ab3d7f563dd6ae0898c7a885 http security |
| 28745235 | 14f0667d12bbb5a2c7c56faac7cbac20 | 14f0667d12bbb5a2c7c56faac7cbac20 |
| 28745236 | 14f0667d12bbb5a2c7c56faac7cbac20 http | 14f0667d12bbb5a2c7c56faac7cbac20 http |
| 28745237 | 14f0667d12bbb5a2c7c56faac7cbac20 http security | 14f0667d12bbb5a2c7c56faac7cbac20 http security |
| 32257289 | 14f079bab0462128f882134fb57f8fca | 14f079bab0462128f882134fb57f8fca |
| 32257290 | 14f079bab0462128f882134fb57f8fca http | 14f079bab0462128f882134fb57f8fca http |
| 32257291 | 14f079bab0462128f882134fb57f8fca http security | 14f079bab0462128f882134fb57f8fca http security |
| 28669590 | 14f0ab9a127f8b4db1650f9c6b29536f | 14f0ab9a127f8b4db1650f9c6b29536f |
+----------+------------------------------------------------+------------------------------------------------+
10 rows in set (0.001 sec)
The expect result is without the 14f* text prefacing the actual search terms.
There are about 55k lines in the finder_terms table that begin with 14. Where does this come from? Even if we remove the finder_terms tables and recreate them the corrupt data seems to be repopulated.
Fedora 31 x86_64
php-7.3.13-1.fc31.x86_64
mariadb-10.3.21-3.fc31.x86_64
joomla-3.9.16
The site is https://linuxsecurity.com/
This comes from some content that you have on your site that is being indexed. At one place, you will have that hash value (as it seems) in your data.
@dwreski Could you check if you have articles where the introtext or the fulltext contain such patterns?
You could use following SQL statement to check for one of these patterns (replace "#__" by your actial table prefix):
SELECT `id`, `title` FROM `#__content` WHERE `introtext` like '%14f003e9ab3d7f563dd6ae0898c7a885%' OR `fulltext` like '%14f003e9ab3d7f563dd6ae0898c7a885%';
You should use the complete term with '%' at beginning and end, in the example '%14f003e9ab3d7f563dd6ae0898c7a885%', or at least a long enough part of it to exclude unwanted results. You could try '%14f0%', too, of course, but I worry you might get unwanted results in case if you have '14f0' in your texts for some other reason.
Please check and let us know if you find some result where those patterns are included in the intro text or the full text.
If you don't find such: Do you have some inline css or js inside your articles?
Yes, that was it. I don't know how our developers never searched the content table itself for this. These strings are MD5 sums that correspond with file downloads.
Is it possible to exclude strings of a certain type or length from being indexed?
Perhaps someone knows where in the component the word is inserted into the database so we can create an exclusion there?
@dwreski I don't think it makes sense to hard-code an exclusion there in the CMS core which might be suitable for you but not for others. And if you do that yourself locally on your site, it will be overwritten with every update of the CMS (which you hopefully regularly do in order to keep it up to date and secure).
So it would need an adjustable option for that, and this would be a new feature, and such go into J4.
Feel free to open a feature request issue for that.
For this issue here it only needs @Hackwar to confirm that there is no way in backend configuration of the finder to exclude such terms from search. Then this issue should be closed because it is no issue, the finder does what it shall do.
What if we were to delete the terms from the finder tables after they were inserted, after each time the finder tables are updated by just iterating through them with the appropriate DELETE statement?
That would be silly.
I don't think you understand - I just deleted 2.2M MD5sum entries that otherwise serve no purpose. The finder tables update once per day, and take 7 seconds to delete 2.2M entries from a 7GB terms table. The search is again much more responsive, and works as an interim solution.
I understand pretty well. You can delete them in your database e.g. in PhpMyAdmin. But it would be silly to let the finder remove them after each run of the indexer. The finder would be busy all the time with nothing else.
Yes, no doubt that is silly. I'm doing it from the command-line with 'mysql -e'. Please help me to identify the section in the finder_indexer script where it does the insert. I don't have the time/resources to do that.
Status | New | ⇒ | Closed |
Closed_Date | 0000-00-00 00:00:00 | ⇒ | 2020-03-21 23:45:16 |
Closed_By | ⇒ | brianteeman |
Can you not use the commons_terms table for the exclusions?
Anyway I am closing this as the issue is not due to any bug in joomla or a corrupt table.
@Hackwar Any idea where these finder terms could come from?