Linear Bayes policy for learning in contextual-bandits

Vargas Perez, Ana Maria y Martín H., José Antonio (2013). Linear Bayes policy for learning in contextual-bandits. "Expert Systems with Applications", v. 40 (n. 18); pp. 7400-7406. ISSN 0957-4174.


Título: Linear Bayes policy for learning in contextual-bandits
  • Vargas Perez, Ana Maria
  • Martín H., José Antonio
Tipo de Documento: Artículo
Título de Revista/Publicación: Expert Systems with Applications
Fecha: Diciembre 2013
Volumen: 40
Palabras Clave Informales: Contextual bandits; Online advertising; Recommender systems; One-to-one Marketing; Empirical Bayes
Escuela: E.T.S.I. Industriales (UPM)
Departamento: Ingeniería de Organización, Administración de Empresas y Estadística
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

Vista Previa
PDF (Document Portable Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (7MB) | Vista Previa


Machine and Statistical Learning techniques are used in almost all online advertisement systems. The problem of discovering which content is more demanded (e.g. receive more clicks) can be modeled as a multi-armed bandit problem. Contextual bandits (i.e., bandits with covariates, side information or associative reinforcement learning) associate, to each specific content, several features that define the “context” in which it appears (e.g. user, web page, time, region). This problem can be studied in the stochastic/statistical setting by means of the conditional probability paradigm using the Bayes’ theorem. However, for very large contextual information and/or real-time constraints, the exact calculation of the Bayes’ rule is computationally infeasible. In this article, we present a method that is able to handle large contextual information for learning in contextual-bandits problems. This method was tested in the Challenge on Yahoo! dataset at ICML2012’s Workshop “new Challenges for Exploration & Exploitation 3”, obtaining the second place. Its basic exploration policy is deterministic in the sense that for the same input data (as a time-series) the same results are obtained. We address the deterministic exploration vs. exploitation issue, explaining the way in which the proposed method deterministically finds an effective dynamic trade-off based solely in the input-data, in contrast to other methods that use a random number generator.

Más información

ID de Registro: 26245
Identificador DC:
Identificador OAI:
Identificador DOI: 10.1016/j.eswa.2013.07.041
URL Oficial:
Depositado por: Memoria Investigacion
Depositado el: 20 Ene 2015 15:29
Ultima Modificación: 01 Ene 2016 23:56
  • GEO_UP4
  • Open Access
  • Open Access
  • Sherpa-Romeo
    Compruebe si la revista anglosajona en la que ha publicado un artículo permite también su publicación en abierto.
  • Dulcinea
    Compruebe si la revista española en la que ha publicado un artículo permite también su publicación en abierto.
  • Recolecta
  • InvestigaM
  • Observatorio I+D+i UPM
  • OpenCourseWare UPM