Broad -coverage hierarchical word sense disambiguation
In naturally occurring language, hearers and readers are faced with large numbers of “ambiguous” words, i.e., words with multiple senses, and “unknown” words, i.e., words they are encountering for the first time and could not be in their lexicon. Ambiguous and unknown words seem to cause little difficulty for humans, who infer their syntactic and semantic properties on the fly to resolve ambiguities, and incorporate unknown words into their lexicons. Ambiguous and unknown words also pose problems for dictionary-based approaches in natural language processing applications. To use the information contained in the dictionary it is necessary to associate each word in the text that is being processed with one of the senses or concepts defined in the dictionary. If the word is ambiguous, it is necessary to identify the intended sense among the possible senses of the word, if it is unknown it is necessary to assign the word to one among all possible senses defined by the dictionary. The acquisition of unknown words can be seen as a disambiguation task in which the possible senses are all senses listed in the dictionary. In this thesis we formulate a single unified approach for learning unknown words, and performing word sense disambiguation. We focus on nouns but our method can be generalized to verbs and other syntactic categories. We propose a broad-coverage method which can be applied to any kind of text. We frame this problem as a pattern classification task. Each ambiguous or unknown word is classified as belonging to one of the existing concepts on the basis of morphological, syntactic and semantic properties of the contexts in which it appears. Our system takes as input an existing dictionary, which defines a hierarchy of concepts, and a corpus of textual data, and disambiguates all nouns in the corpus. We demonstrate this by disambiguating all nouns in a 40 million words collection of newspaper articles. We present empirical results from experiments carried out also with novel multi-level classification techniques, which exploit generalizations that hold at different levels of the concept hierarchy.
0984: Computer science