As you might be aware, I’m a huge proponent of the democratization of information. It’s one of the reasons I’m such a fan of RSS and I’m afraid of Google. With their virtual monopoly on search they’ve locked people in to the information that they’re presenting. While you may believe Google is benevolent, they’re a for-profit corporation just like any other company. They hide information about Nazi’s if you search in their French pages and other such stories exist. It’s not an issue I have with Google specifically, but anybody who controls the “gateway” to the Internet. The same fears apply to Yahoo Search and MSN Search.
I want an open source democratized search engine. I want to be able to hit a page or use a desktop client and search a database that isn’t controlled by anybody using code that is developed by many. It looks like there’s a group Nutch that has developed code in Java to be a core engine, which looks like it may solve the latter half of my problem. Unfortunately, it’s not easy to get a multi-terabyte set up that could host web content.
Thus, I think there should be a partition based system by which portions of the database are subpartitioned in a peer to peer system. Some subset of the peers host subpartition A and another subset of peers host subpartition B, and so on. A peer in the subparition would be responsible for its own crawling and refreshing its own subset of the database. The only problem with having the data partitioned across multiple peers is the execution of an actual search. Theoretically, it would require the search processor to execute the search on a peer in each of the subpartitions, which might not be that optimal.
The basic idea, as you can see, is to distribute subsections of the data across peers, which each peer group replicating the subsection amongst itself and responsible for doing its own update. Thoughts? Any volunteers to help make it?