Language identification based on a discriminative text categorization technique

Caraballo Morcillo, Miguel Ángel, D'Haro Enríquez, Luis Fernando

, Córdoba Herralde, Ricardo de

, San Segundo Hernández, Rubén

and Pardo Muñoz, José Manuel

(2012). Language identification based on a discriminative text categorization technique. En: "IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop", 21/11/2012 - 22/11/2012, Madrid, Spain. pp. 193-203.

Descripción

Título:	Language identification based on a discriminative text categorization technique
Autor/es:	Caraballo Morcillo, Miguel Ángel D'Haro Enríquez, Luis Fernando https://orcid.org/0000-0002-3411-7384 Córdoba Herralde, Ricardo de https://orcid.org/0000-0002-7136-9636 San Segundo Hernández, Rubén https://orcid.org/0000-0001-9659-5464 Pardo Muñoz, José Manuel https://orcid.org/0000-0002-1009-590X
Tipo de Documento:	Ponencia en Congreso o Jornada (Artículo)
Título del Evento:	IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
Fechas del Evento:	21/11/2012 - 22/11/2012
Lugar del Evento:	Madrid, Spain
Título del Libro:	IberSPEECH 2012 - VII Jornadas en Tecnología del Habla and III Iberian SLTech Workshop
Fecha:	2012
Materias:	Telecomunicaciones
ODS:	04. Educación de calidad 09. Industria, innovación e infraestructura
Palabras Clave Informales:	Language Identification, n-gram frequency ranking, discriminative rankings, text categorization, PPRLM
Escuela:	E.T.S.I. Telecomunicación (UPM)
Departamento:	Otro
Licencias Creative Commons:	Reconocimiento - Sin obra derivada - No comercial

Texto completo

Vista Previa

PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (1MB) | Vista Previa

Resumen

In this paper, we describe new results and improvements to a lan-guage identification (LID) system based on PPRLM previously introduced in [1] and [2]. In this case, we use as parallel phone recognizers the ones provided by the Brno University of Technology for Czech, Hungarian, and Russian lan-guages, and instead of using traditional n-gram language models we use a lan-guage model that is created using a ranking with the most frequent and discrim-inative n-grams. In this language model approach, the distance between the ranking for the input sentence and the ranking for each language is computed, based on the difference in relative positions for each n-gram. This approach is able to model reliably longer span information than in traditional language models obtaining more reliable estimations. We also describe the modifications that we have being introducing along the time to the original ranking technique, e.g., different discriminative formulas to establish the ranking, variations of the template size, the suppression of repeated consecutive phones, and a new clus-tering technique for the ranking scores. Results show that this technique pro-vides a 12.9% relative improvement over PPRLM. Finally, we also describe re-sults where the traditional PPRLM and our ranking technique are combined.

Más información

ID de Registro:	20380
Identificador DC:	https://oa.upm.es/20380/
Identificador OAI:	oai:oa.upm.es:20380
Depositado por:	Memoria Investigacion
Depositado el:	05 Oct 2013 07:22
Ultima Modificación:	22 Mar 2023 16:30

Estadísticas

Exportar cita

Editar (sólo personal del Archivo)

En esta página

Menú principal

Buscar

Language identification based on a discriminative text categorization technique

Cita

Descripción

Texto completo

Resumen

Más información

Acciones

Documentos

El repositorio

Agrupados por ...

Datos Investigación

Financiadores

Especiales

En otros formatos

Redes sociales

Información adicional