Mímir

Valentin Tablan

2011-05-14

When people use search tools they are not really looking for words, they are looking for information, which happens to occur encoded as words in documents. We in the GATE team are pretty good at letting computers get at [some of] this information. A few years ago we started thinking we could use our information extraction work to provide better search tools. The result of these thoughts (and many years of work) is Mímir, a multi-paradigm index that uses text content, annotations and semantics to let you find what you’re looking for.

This sounds complicated, so let’s look at an example. Everyone knows you can use Google to search for a word like “2011”. This works because Google have built an index containing all the words in the text of all the documents on the web (we’re simplifying things a little). This type of search, where the user enters one or more words and all the documents containing them are found is called full-text search. This has been the state of the art for many years and the same techniques are used by most current search engines. While most of them also allow so called ‘advanced search’ where you can use for example boolean operators, or add clever heuristics such as also searching for the synonyms of the query terms, the basic ideas are the same as they were decades ago.

Returning to our example, text analysis tools (such as GATE), are able to tell us that the word “2011” is actually the year part of a date when it occurs in a context like “21/01/2011”. If we were to store such information in an index, you would then be able to search for “{Date}” and match such a context. More advanced analysis tools are also able to actually parse the date expression and understand it as referring to the 21st of January 2011. An index the also includes such semantics would let you search for all dates that are, say, after the 15th of December 2010 and before the 30th of March 2011 or, “{Date normalized >= 20101215 normalized <= 20110330}”.

Mímir is an indexing system capable of doing all of the above: it uses the text content, annotations (which can be linguistic or semantic), and formal semantics (such as OWL ontologies) to support complex searches over large text collections.

If this sounds interesting, go ahead and have a play with one of our test indexes!

If you like what you see, have questions, or suggestions, drop me a line using the contact page in the menu of this site.