Search algorithm

Question

I am interested in hearing about the self-coded PHP/MySQL site search processes others use.

MySQL LIKE simply doesn't cut it.
MySQL MATCH AGAINST IN BOOLEAN MODE is an improvement over LIKE, but still disappoints.
Lucene-based crawlers/search engines require Java and special applications on the server, and seem most appropriate for huge document collections.
PHP Sphider combines a PHP-based crawler/indexer with good search features, but the code is ancient and buggy. The code is easy to understand so I have considered overhauling it, but that would be time consuming and success is not assured.

Although my whole CMS database is large, each client site has no more than 500 pages to search, so iterating through all the page content and using PHP to process results seems like a viable option. I can then remove common words like "the, to, a, with" from the search string, create an array of the search terms and iterate through the database content results using functions like similar_text (), rank result relevance, etc.

Working directly on the database content instead of crawling websites has the advantage of making it easy to target the most relevant content. For instance, on a three column layout, in my CMS the left column content is always of low importance because it is dropped from the mobile display. A crawler would have to be configurable to ignore parts of the page.

It doesn't make sense to reinvent a wheel that has already been built many times, but I've done a fair amount of research on this and have not found any available code that wasn't broken, was editable, and seemed like it would do a good job, though I have collected some ideas and snippets to use in my own algorithm, if I do end up writing this from scratch.

I'm sure some of you have tackled the same project. Thoughts? Advice? Code?

sinious · Answer

While I also don't like to reinvent the wheel, I think indexing is still the best strategy for keyword based searches. As you mentioned, you can easily build your own index on your own CMS and you know which tables are the highest priority or most likely to contain the desired results. Removing all but the most deserving of keywords from both the search query and your content will speed things up plenty for a site below 10,000 pages. I found that most people came to the data I had to search for some seriously narrow reasons and was able to speed it up even more keeping a recent or similar search history to feed from (as long as you remember how to get to the data so any changes are reflected in that cache).

I realize very few pros want a logo, but offloading the search to something like a Google custom search is always an option. If you're not implementing a grammar-esque context based search then it ultimately falls down to just keywords. The only way I'd put it on a large information site is people (mostly) trust Google and of course Google is going to take 'context' into factoring your results, along with the job of actually performing the work and keeping their results up to date.

When you consider the sheer amount of logic to take into context it's pretty frightening on how black and white it is using intelligent search and simple tokenized keyword search. A few reasons are, content in different portions of the page should be considered different (a title versus headline text versus body content, etc). You'll get erratic results from tokenizing things that belong together like "pet friendly hotel in new york" because it doesn't know pet and friendly relate to the hotel, nor that new and york belong together as a destination. Just the grouping and contextual intelligence along is fairly nightmareish to approach. SEO crazy people tend to litter content everywhere, like in the page URL, title, headline and over and over in the content itself weighing down the number of matches per page perhaps in an unintended way ("pet friendly hotel in new york" may turn up a yorkshire terrier pet page before a hotel, etc). Add in even publicly usable assistance systems like "was this document relevant to your search?", etc etc and you're down 29 bunny trails in a quick hurry .

Is keyword good enough for you?

Sign up

To post, reply, or follow discussions, please sign in with your Adobe ID.

Sign in to Adobe Community

To post, reply, or follow discussions, please sign in with your Adobe ID.

Scanning file for viruses.

This file cannot be downloaded