Browse by author
Lookup NU author(s): Dr Hermann Moisl
The advent of large electronic text corpora has generated a range of technologies for their search and interpretation. Variation in document length can be a problem for these technologies, and several normalization methods for mitigating its effects have been proposed. This paper assesses the effectiveness of such methods in specific relation to exploratory multivariate analysis. The discussion is in four main parts. The first part states the problem, the second describes some normalization methods, the third identifies poor estimation of the population probability of variables as a factor that compromises the effectiveness of the normalization methods for very short documents, and the fourth proposes elimination of data matrix rows representing documents which are too short to be reliably normalized and suggests ways of identifying the relevant documents.
Author(s): Moisl HL
Publication type: Conference Proceedings (inc. Abstract)
Publication status: Unknown
Conference Name: INFOS2008: 6th International Conference on Informatics and Systems
Year of Conference: 2008
Date deposited: 23/04/2010
Sponsor(s): IEEE