n-gram Frequency Ranking with additional sources of information in a multiple-Gaussian classifier for Language Identification

Córdoba Herralde, Ricardo de and D'haro Enríquez, Luis Fernando and Lucas Cuesta, Juan Manuel and Zugasti Raposo, Javier (2008). n-gram Frequency Ranking with additional sources of information in a multiple-Gaussian classifier for Language Identification. In: "V Jornadas en Tecnología del Habla", 12/11/2008-14/11/2008, Bilbao. ISBN 978-84-9860-169-5. pp. 49-52.

Description

Title: n-gram Frequency Ranking with additional sources of information in a multiple-Gaussian classifier for Language Identification
Author/s:
  • Córdoba Herralde, Ricardo de
  • D'haro Enríquez, Luis Fernando
  • Lucas Cuesta, Juan Manuel
  • Zugasti Raposo, Javier
Item Type: Presentation at Congress or Conference (Article)
Event Title: V Jornadas en Tecnología del Habla
Event Dates: 12/11/2008-14/11/2008
Event Location: Bilbao
Title of Book: Libro de Actas
Date: 2008
ISBN: 978-84-9860-169-5
Subjects:
Freetext Keywords: Language Identification, n-gram frequency ranking, score normalization, feature selection, PPRLM
Faculty: E.T.S.I. Telecomunicación (UPM)
Department: Ingeniería Electrónica
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[img]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (60kB) | Preview

Abstract

We present new results of our n-gram frequency ranking used for language identification. We use a Parallel phone recognizer (as in PPRLM), but instead of the language model, we create a ranking with the most frequent n-grams. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to model reliably a longer span than PPRLM. This approach overcomes PPRLM (15% relative improvement) due to the inclusion of 4-gram and 5-gram in the classifier. We will also see that the combination of this technique with other sources of information (feature vectors in our classifier) is also advantageous over PPRLM, showing also a detailed analysis of the relevance of these sources and a simple feature selection technique to cope with long feature vectors. The test database has been significantly increased using cross-fold validation, so comparisons are now more reliable.

More information

Item ID: 3141
DC Identifier: http://oa.upm.es/3141/
OAI Identifier: oai:oa.upm.es:3141
Official URL: http://jth2008.ehu.es/
Deposited by: Memoria Investigacion
Deposited on: 27 May 2010 08:22
Last Modified: 20 Apr 2016 12:41
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM