Search algorithm

Report · Oct 01, 2015

I am interested in hearing about the self-coded PHP/MySQL site search processes others use.

MySQL LIKE simply doesn't cut it.
MySQL MATCH AGAINST IN BOOLEAN MODE is an improvement over LIKE, but still disappoints.
Lucene-based crawlers/search engines require Java and special applications on the server, and seem most appropriate for huge document collections.
PHP Sphider combines a PHP-based crawler/indexer with good search features, but the code is ancient and buggy. The code is easy to understand so I have considered overhauling it, but that would be time consuming and success is not assured.

Although my whole CMS database is large, each client site has no more than 500 pages to search, so iterating through all the page content and using PHP to process results seems like a viable option. I can then remove common words like "the, to, a, with" from the search string, create an array of the search terms and iterate through the database content results using functions like similar_text (), rank result relevance, etc.

Working directly on the database content instead of crawling websites has the advantage of making it easy to target the most relevant content. For instance, on a three column layout, in my CMS the left column content is always of low importance because it is dropped from the mobile display. A crawler would have to be configurable to ignore parts of the page.

It doesn't make sense to reinvent a wheel that has already been built many times, but I've done a fair amount of research on this and have not found any available code that wasn't broken, was editable, and seemed like it would do a good job, though I have collected some ideas and snippets to use in my own algorithm, if I do end up writing this from scratch.

I'm sure some of you have tackled the same project. Thoughts? Advice? Code?

Report · Oct 01, 2015

While I also don't like to reinvent the wheel, I think indexing is still the best strategy for keyword based searches. As you mentioned, you can easily build your own index on your own CMS and you know which tables are the highest priority or most likely to contain the desired results. Removing all but the most deserving of keywords from both the search query and your content will speed things up plenty for a site below 10,000 pages. I found that most people came to the data I had to search for some seriously narrow reasons and was able to speed it up even more keeping a recent or similar search history to feed from (as long as you remember how to get to the data so any changes are reflected in that cache).

I realize very few pros want a logo, but offloading the search to something like a Google custom search is always an option. If you're not implementing a grammar-esque context based search then it ultimately falls down to just keywords. The only way I'd put it on a large information site is people (mostly) trust Google and of course Google is going to take 'context' into factoring your results, along with the job of actually performing the work and keeping their results up to date.

When you consider the sheer amount of logic to take into context it's pretty frightening on how black and white it is using intelligent search and simple tokenized keyword search. A few reasons are, content in different portions of the page should be considered different (a title versus headline text versus body content, etc). You'll get erratic results from tokenizing things that belong together like "pet friendly hotel in new york" because it doesn't know pet and friendly relate to the hotel, nor that new and york belong together as a destination. Just the grouping and contextual intelligence along is fairly nightmareish to approach. SEO crazy people tend to litter content everywhere, like in the page URL, title, headline and over and over in the content itself weighing down the number of matches per page perhaps in an unintended way ("pet friendly hotel in new york" may turn up a yorkshire terrier pet page before a hotel, etc). Add in even publicly usable assistance systems like "was this document relevant to your search?", etc etc and you're down 29 bunny trails in a quick hurry .

Is keyword good enough for you?

Report · Oct 02, 2015

offloading the search to something like a Google custom search is always an option.

There are a couple of reasons Google custom search isn't an option for my system.

My search feature needs to work without the need to setup and pay for an external account, the system being highly automated.
Search results are from courses, pages, blogs and merchandise tables and I need to format course results differently from page results, etc.

Is keyword good enough for you?

Yes. Keyword is fine. There will be some odd search returns, as you illustrated in your pet/NYC example, but the client websites all have narrow context to begin with, so search returns would never be very far off base.

I have discovered a very interesting article that includes a PHP code example that is going to be central to my solution:

Simple Search: The Vector Space Model

Report · Oct 06, 2015

Thanks for sharing that, it was a good clean and clear example of varying ways to weight information. I've always used an index with inverse, the more rare the more significant. Although typos can riddle this category.

Is this the direction you went and if so how is it working for you? It's good to hear that you have data coming from categorized areas, that's a huge leap forward itself.

Report · Oct 06, 2015

Is this the direction you went and if so how is it working for you?

This project was taking too much time (time I wasn't making money from), so I set it aside and simply made some improvements to the MySQL match against method I was using originally.

Instead of querying my tables separately, I got everything in one query, which allowed MySQL to rank all the results as a group instead of just within each table. It is also now, as of version 5.6, possible to run full text searches on innodb, whereas previously it was only possible on myisam.

Ultimately though, I think I will put my effort into the Lucene/SOLR option. when I do I will report back here.

Report · Oct 15, 2015

You may want to checkout elasticsearch. It uses Lucene engine and is very easy to configure and scale.

Adobe Community

Search algorithm