How does it work?
Retrieval Vagabond on Information Networks) searches the Web and
selects only documents that are relevant to a specific and chosen
domain. Document relevance is computed according to a formula
that takes into consideration the number of words from a glossary
of significant terms that MARVIN
finds in the document, as well as their place in the document.
first been applied to healthcare.
MARVIN stores selected
documents in a database that users can then query, for example ,
HON's own medical search engine. MARVIN
is also applied to a variety of scientific domains, such as molecular
biology and 2-D electrophoresis, constantly feeding and updating
the different databases.
MARVIN was designed
as a multi-agent softbot ().
Each agent possesses filtering capabilities. The agent downloads Web pages
and computes the medical "score" of each page. Using a glossary
of medical terms which calculates the frequency of the appearance of words
in the glossary.Categorising documents: medical
The score processed by MARVIN defines if a Web page is medical or health-related or not by adding
up the number of medical terms in the document, taking into account the
different translations and the weight of each medical terms as defined
by the built-in glossary.
In the medical domain many thesaurusi and glossaries already existed
such as the (Medical Subject Headings) from the (NLM) and the glossary in nine European languages
developed at the , University of Ghent, Belgium, within the
framework of a European project. For our application, HON built its own
thesaurus by compiling several of these sources. Starting with bilingal
(English/French) medical terms (12,000), the thesaurus was expanded with
Danish, Dutch, German, Italian, Portuguese and Spanish, resulting in a
thesaurus of 20,000 multilingual medical terms (not counting the 33,000
Studies were undertaken to provide an estimate of the relative importance
of a term in a document and in a collection of documents, allowing us
to weight each medical term included in our medical glossary. 1,000 documents
known to be related to the medical and health topics and 1,000 related
to other domains except medical and health were analysed. The medical
terms included in each Web page were then evaluated. This study, associated
with other techniques such as the formula of Wilbur and Yang (An analysis
of statistical term strength and its use in the indexing and retrieval
of molecular biology texts, Comp. Bio. Med. 26.3 p. 209-222, 1996) allowed us to define a threshold for each terms contained in our medical
Using our multilingual medical thesaurus of 50,000 terms, the download of
Web pages and the calculation of a score according to the page content, MARVIN generates using
a classical inverted index: in which each word is associated with the list
of documents containing the word. Matching the requested terms is then a
simple and efficient task.
Fig. 1 MARVIN multi-agent architecture