Probabilistic Archetypal Analysis

Abstract

Given a set of observations, archetypal analysis finds ‘extreme’ examples, i.e., archetypes that represent the observations well. Following the geometric formulation proposed by Cutler and Breiman (1994) this is achieved by approximating the convex hull of the set of observations with the archetypes such that the observations can be explained as convex combinations of the archetypes; an analogy being the colors red, green and blue that can explain the color spectrum as convex combinations of these archetypal colors. Archetypal analysis can be seen as a matrix factorization problem, and is closely related to other ‘prototype’ finding approaches, e.g., k-means clustering and topic modelling. The standard approach of finding archetypes assumes that the observations are real valued, which, unfortunately, is not compatible with many practical situations. For example, one may ask to find archetypal responses for a set of binary questions, or archetypal document given a set of word count vectors of a set of documents. In this talk, I will revisit archetypal analysis from the basic principles, and discuss a probabilistic framework that accommodates these scenarios, i.e., data types such as integers, categorical, and stochastic vector. This formulation is equivalent to performing archetypal analysis in the continuous parameter space of the probability distribution than in the discrete observation space, and for a range of exponential family distributions, such as Bernoulli, Poisson, and multinomial, the resulting optimization problem can be efficiently solved using majorization-minimization. For categorical variables, e.g., multiple-option questions, I will introduce an extension of this approach to a generative framework using Dirichlet prior over the mixing parameters for which the approximate posterior distribution can be efficiently inferred using variational Bayes', and associated hyperparameters help finding a suitable number of archetypes. I will show the application of these formulations for finding archetypal tourists based on binary survey data, archetypal disaster-affected countries based on disaster count data, archetypal customers using German credit data, archetypal images using SUN image attribute data, and archetypal behaviour from Big Five personality data. I will also present an appropriate visualization tool to summarize the archetypal analysis solution, and address some recent developments in this area and some open questions.

Date
Sep 11, 2019 10:00 AM — 10:45 AM
Sohan Seth
Sohan Seth
Lead Data Scientist

Lead Data Scientist (Senior Research Fellow equivalent) at the School of Informatics, University of Edinburgh.