Search algorithm
I am interested in hearing about the self-coded PHP/MySQL site search processes others use.
- MySQL LIKE simply doesn't cut it.
- MySQL MATCH AGAINST IN BOOLEAN MODE is an improvement over LIKE, but still disappoints.
- Lucene-based crawlers/search engines require Java and special applications on the server, and seem most appropriate for huge document collections.
- PHP Sphider combines a PHP-based crawler/indexer with good search features, but the code is ancient and buggy. The code is easy to understand so I have considered overhauling it, but that would be time consuming and success is not assured.
Although my whole CMS database is large, each client site has no more than 500 pages to search, so iterating through all the page content and using PHP to process results seems like a viable option. I can then remove common words like "the, to, a, with" from the search string, create an array of the search terms and iterate through the database content results using functions like similar_text (), rank result relevance, etc.
Working directly on the database content instead of crawling websites has the advantage of making it easy to target the most relevant content. For instance, on a three column layout, in my CMS the left column content is always of low importance because it is dropped from the mobile display. A crawler would have to be configurable to ignore parts of the page.
It doesn't make sense to reinvent a wheel that has already been built many times, but I've done a fair amount of research on this and have not found any available code that wasn't broken, was editable, and seemed like it would do a good job, though I have collected some ideas and snippets to use in my own algorithm, if I do end up writing this from scratch.
I'm sure some of you have tackled the same project. Thoughts? Advice? Code?
