Mai 28 2012

Textual Characteristics for Language Engineering

Veröffentlicht by

Language statistics have been widely used to characterize and understand languages and for other tasks like language learning and dictionary creation. In parallel the methods and algorithms used for Text Mining, Information Retrieval or Information Extraction grew rapidly, normally published with results on standardized (news paper) corpora. There was however no attempt to connect the areas by using language statistics to characterize the corpora on which evaluations were made, and to help other researchers pick the right algorithms for their corpus. The authors of this work strongly believe, that no results for text analysis algorithms should be published without quantitatively describing the evaluation corpus — only then the real value of the methods can be determined as well as their portability to other sublanguages. This work tries to lay ground by gathering and defining a set of language characteristics we consider valuable with respect to building language processing systems. We explicitly call upon the scientific community to provide feedback and help establish a good practice of corpus-aware evaluations.

[paper]

  • [2012,inproceedings] bibtex
    M. Bank, R. Remus, and M. Schierle, "Textual Characteristics for Language Engineering," in Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23-25 2012.
    @InProceedings{BANK2012_1,
      author = {Mathias Bank and Robert Remus and Martin Schierle},
      title = {Textual Characteristics for Language Engineering},
      booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
      year = {2012},
      month = {may},
      date = {23-25},
      address = {Istanbul, Turkey},
      editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Uğur Doğan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
      publisher = {European Language Resources Association (ELRA)},
      isbn = {978-2-9517408-7-7},
      language = {english}
    }

No responses yet

Trackback URI | Comments RSS

Hinterlasse eine Antwort