Toggle Main Menu Toggle Search

Open Access padlockePrints

Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora

Lookup NU author(s): Dr Hermann Moisl

Downloads


Abstract

Cluster analysis is an important tool for data exploration in corpus linguistics. Data abstracted from a corpus may, however, have characteristics that can adversely affect the validity of clustering results, and these must be rectified prior to analysis. This paper deals with one that can arise when the aim is to cluster a document collection by the frequency of textual features and there is substantial variation in the lengths of the documents. The discussion is in three main parts. The first part shows why variation in document length can be a problem for frequency-based clustering. The second describes some data normalizations to deal with the problem and shows that these are ineffective where documents are too short to provide reliable probability estimates for data variables. The third uses statistical sampling theory to develop a method for identifying documents that are too short for normalization to be effective, and proposes that such documents be excluded from the analysis.


Publication metadata

Author(s): Moisl H

Publication type: Article

Publication status: Published

Journal: Journal of Quantitative Linguistics

Year: 2011

Volume: 18

Issue: 1

Pages: 23-52

Print publication date: 24/02/2011

Date deposited: 27/09/2010

ISSN (print): 0929-6174

ISSN (electronic): 1744-5035

Publisher: Routledge

URL: http://dx.doi.org/10.1080/09296174.2011.533588

DOI: 10.1080/09296174.2011.533588


Altmetrics

Altmetrics provided by Altmetric


Share