Mai 28 2012
Textual Characteristics of Different-sized Corpora
Recently, textual characteristics, i.e. certain language statistics, have been proposed to compare corpora originating from different genres and domains, to give guidance in language engineering processes and to estimate the transferability of natural language processing algorithms from one corpus to another. However, until now it is unclear how these textual characteristics behave for different-sized corpora. We monitor the behavior of 7 textual characteristics across 4 genres – news articles, Wikipedia articles, general web text and fora posts – and 10 corpus sizes, ranging from 100 to 3,000,000 sentences. Thereby we show, certain textual characteristics are almost constant across corpus sizes and thus might be used to reliably compare different-sized corpora, while others are highly corpus size-dependent and thus may only be used to compare similar- or same-sized corpora. Moreover we find, although textual characteristics vary from genre to genre, their behavior for increasing corpus size is quite similar.
[paper]
-
[2012,inproceedings] bibtexR. Remus and M. Bank, "Textual Characteristics of Different-sized Corpora," in Proceedings of the Fifth Workshop on Building and Using Comparable Corpora (BUCC’12), 2012.
@InProceedings{Bank2012_3,
author = {Robert Remus and Mathias Bank},
title = {Textual Characteristics of Different-sized Corpora},
booktitle = {Proceedings of the Fifth Workshop on Building and Using Comparable Corpora (BUCC'12)},
year = {2012}
}
