by

Building Your Own Blog Search Engine in an Afternoon

A couple of weekends ago, I sat down and thought it’d be fun to write a blog search engine just by giving it a go and seeing what happened. It turns out that writing a simple, basic search system for blogs is pretty easy.

RedYawning Search ended up writing has three components: a library to get updates from the blogosphere on what’s changed, a service that runs to pull down RSS feeds and write them to disk and a web site to let you execute the queries. The total time in writing the RY Search was about 6 hours.

The first problem is getting some content for you to crawl. In order to get myself bootstrapped, I knew that sites like weblogs.com and blo.gs allow sites to ping them and let applications consume those updates via some API. Weblogs.com is a pretty wacky system of a gigantic XML file but blo.gs provides a real-time XML fragment stream that you can hook in to get real time updates that it’s receiving. To consume those updates, I wrote a little C# library called CloudStreamer that hooks in to the stream from blo.gs and fires back a event to all the event subscribers for every weblog that comes across its path. The only trick I ran in to on CloudStreamer was sometimes the blo.gs feed would drop out and stop transmitting, so CloudStreamer has the ability to recycle itself if it stops getting data or is getting bad data.

The Service was a multi-threaded app that’s backed by a database. It has two heads, the first to receive and queue blog updates via CloudStreamer and the second to dequeue updates out of the database and read, parse and save their RSS (or ATOM) feeds. In a typical search system, this component is the crawler. In the database is stored the feed locations and whether they need updates and the metadata for each entry (e.g. the permalink, the Title, and a unique ID). When stored to disk the individual entries are written down as regular text files into a file named whatever the ID was, in to a directory modulo some partitioning value of the ID (in order to keep directories from getting too full or hard to manage).

In order to actually index the individual RSS entries, the major saving grace is the Windows Search service. I can’t believe this thing: it’s been around since Windows 2000 and it rocks. I can’t believe nobody has used it. My use of it is super straightforward: I tell it to index the files that are stored to disk. That’s it. Whenever you execute a search against its index, it returns to you the name of the file it found the words in, which, conveniently enough, is the ID of the entry. The last piece, the web site, simply accesses the Search service through its ODBC interface, gets a list of filenames, uses those as IDs to look up titles in the database and poof, a blog search engine.

Download RedYawningSearch v1.0.