Using a lexical database for semantic analysis of word lists for language learning
Corpus-derived word lists are increasingly being used in the production of language learning materials. This is in an effort to focus vocabulary study on high-frequency items in the belief that these will be the most helpful for learners. However, these lists are generally compiled without considering word sense or part-of-speech, as many large corpora consist mainly of raw text without labels to provide semantic information. Additionally, these word lists have generally not been subjected to detailed semantic analysis even after being compiled. Absent this semantic information, researchers often assume that words can be represented by a single canonical sense and that learners who know the canonical sense can be assumed to know all other senses of a word. How many word senses does this assumption really entail knowing? This talk will provide an introduction to WordNet, a freely available and machine-readable lexical database of English, and show how it can be used with a small amount of programming to provide a preliminary analysis of the semantic ambiguity present in a commonly used word list for language learners. This talk will be of interest to anyone concerned with vocabulary acquisition or computational approaches to language learning materials development.