Friday, May 26, 2006

Search is the unifying solution

This is the phrase used by Eric Schmidt to explain Google’s portal-like moves when the new products to improve web search were announced and I am beginning to use it to explain my interest in search. I definitely don’t have a decade in search industry like Danny and I am not writing an afterward to a book that it’s the best reference to introduce search to people unfamiliar with the industry but I have some time following search engines technologies, in this post I will share my little journey with web search so far.


I was looking for a topic to write my engineering thesis back in 1998 and XML had just become a W3C recommendation so it got my attention. While looking where to apply it on the web I got frustrated with search engines so I figured I could do something with XML to help improve the search experience. I thought XML was going to change everything so I decided to build a search engine that indexes XML documents with a natural language interface.


The topic was relatively new and my allies in this journey were books like the XML Black Book, Natural Language Understanding, Managing Gigabytes to understand documents indexing and subscriptions to the Search Engine Watch and Sociedad Española para el Procesamiento del Lenguaje Natural to keep informed. After two years of dealing with the slow maturing of XML parsers I got a small set of news pages in Spanish which I manually marked up into NewsML format and build a thesaurus in XML format to get the synonyms of query terms.


The idea was simple, a user formulates a question which is analyzed and related terms are added to the query that is sent to the XML indexing engine, the results were ranked using term frequency. Every data structure was stored in XML so performance was slow but the experiments proof the concept that with XML markup and natural language query analysis relevant documents that were ignored became visible to the user.


In my thesis I concluded that the future search engine will be a meta-search with natural language capabilities that will query various vertical/specialized XML indexes and rank the results according to the question formulated and some link analysis using XLink, my assumption was that XML was going to be the dominant format in the web but it didn’t happen yet. With feed (RSS/Atom) search engines becoming more popular, meta search engines like gada.be getting more attention, XLink’s new Candidate Recommendation and a new NewsML 2 Architecture this scenario can still happen.


With Google delivering good enough results I was just a search power user until last year. Back to school and while looking for a topic for my MSc thesis I got interested in concept-based search and joined the Personal Digital Library Project in my school. My current thesis is about concept-based ranking using user contributed tags/labels and attention metadata in personal repositories.


And to stay up-to-date I listen to The Daily SearchCast and other WebmasterRadio shows, read the excellent coverage of the SES conferences available at the Search Engine Roundtable - read my own coverage of SEW Live Seattle in the next post - and subscribe to various search related blogs, if you don’t want to wait for Danny’s OPML this blogroll is a good starting point. Why search is so important now? After all, a lot of the information generated is searchable.

Labels:


Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?