Browse by author
Lookup NU author(s): Dr Hermann Moisl
Cluster analysis is an important tool for data exploration in corpus linguistics. Data abstracted from a corpus may, however, have characteristics that can adversely affect the validity of clustering results, and these must be rectified prior to analysis. This paper deals with one that can arise when the aim is to cluster a document collection by the frequency of textual features and there is substantial variation in the lengths of the documents. The discussion is in three main parts. The first part shows why variation in document length can be a problem for frequency-based clustering. The second describes some data normalizations to deal with the problem and shows that these are ineffective where documents are too short to provide reliable probability estimates for data variables. The third uses statistical sampling theory to develop a method for identifying documents that are too short for normalization to be effective, and proposes that such documents be excluded from the analysis.
Author(s): Moisl H
Publication type: Article
Publication status: Published
Journal: Journal of Quantitative Linguistics
Year: 2011
Volume: 18
Issue: 1
Pages: 23-52
Print publication date: 24/02/2011
Date deposited: 27/09/2010
ISSN (print): 0929-6174
ISSN (electronic): 1744-5035
Publisher: Routledge
URL: http://dx.doi.org/10.1080/09296174.2011.533588
DOI: 10.1080/09296174.2011.533588
Altmetrics provided by Altmetric