Masters Internship Report Extraction of information in large graphs; Automatic search for synonyms

I used the Online Plain Text English Dictionary (OPT 2000) which is based on the "Project Gutenberg Etext of Webster’s Unabridged Dictionary" which is in turn based on the 1913 US Webster’s Unabridged Dictionary. It consists in 27 HTML les (one for each letter of the alphabet, and one for several additions). The problem was to parse these les in order to build the graph of the dictionary: there is a vertex for every word and an arc from vertex i to vertex j if j appears in the denition of i. I encountered several problems in this operation: Some words dened in the Webster’s dictionary were in fact multi-words (e.g. All Saints’, Surinam toad). I decided not to include them into the graph, since there is no simple way, when you see two words side-byside, to decide whether they should be interpreted as single words or as a multi-word.