Language identification based on a discriminative text categorization technique

Caraballo Morcillo, Miguel Ángel and D'haro Enríquez, Luis Fernando and Córdoba Herralde, Ricardo de and San Segundo Hernández, Rubén and Pardo Muñoz, José Manuel (2012). Language identification based on a discriminative text categorization technique. In: "IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop", 21/11/2012 - 22/11/2012, Madrid, Spain. pp. 193-203.

Description

Title: Language identification based on a discriminative text categorization technique
Author/s:
  • Caraballo Morcillo, Miguel Ángel
  • D'haro Enríquez, Luis Fernando
  • Córdoba Herralde, Ricardo de
  • San Segundo Hernández, Rubén
  • Pardo Muñoz, José Manuel
Item Type: Presentation at Congress or Conference (Article)
Event Title: IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
Event Dates: 21/11/2012 - 22/11/2012
Event Location: Madrid, Spain
Title of Book: IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
Date: 2012
Subjects:
Freetext Keywords: Language Identification, n-gram frequency ranking, discriminative rankings, text categorization, PPRLM
Faculty: E.T.S.I. Telecomunicación (UPM)
Department: Otro
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[img]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB) | Preview

Abstract

In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.

More information

Item ID: 20380
DC Identifier: http://oa.upm.es/20380/
OAI Identifier: oai:oa.upm.es:20380
Deposited by: Memoria Investigacion
Deposited on: 05 Oct 2013 07:22
Last Modified: 21 Apr 2016 23:08
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM