Toggle Main Menu Toggle Search

Open Access padlockePrints

Normalization for Variation in Document Length in Exploratory Multivariate Analysis of Text Corpora

Lookup NU author(s): Dr Hermann Moisl

Downloads


Abstract

The advent of large electronic text corpora has generated a range of technologies for their search and interpretation. Variation in document length can be a problem for these technologies, and several normalization methods for mitigating its effects have been proposed. This paper assesses the effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion is in four main parts. The first part states the problem, the second describes some normalization methods, the third identifies poor estimation of the population probability of variables as a factor that compromises the effectiveness of the normalization methods for very short documents, and the fourth proposes elimination of data matrix rows representing documents which are too short to be reliably normalized and suggests ways of identifying the relevant documents.


Publication metadata

Author(s): Moisl HL

Publication type: Conference Proceedings (inc. Abstract)

Publication status: Unknown

Conference Name: INFOS2008: 6th International Conference on Informatics and Systems

Year of Conference: 2008

Date deposited: 23/04/2010

Sponsor(s): IEEE


Share