## Introduction to Nonnegative Matrix Factorization

for the Data Science Reading Group meetup July 5, 2017

## General comments

This paper mostly did what I’d hoped: give a recent overview of the field of nonnegative matrix factorization (NMF), with lots of links to other work for those who want to dig deeper.

It’s a bit math-y though and so maybe not to everyone’s taste. In these notes I’ll try to provide a gentler introduction and some of my own links that I found along the way.

Is it worth the trouble? I think so. It is a very widely used technique and many machine learning papers assume that you know what NMF is. And at least a couple of people (including François Chollet, creator of Keras) picked it as their favorite ML algorithm in this Quora question.

## An example: topic modeling

The paper starts from the general and works toward the specific. We’ll start with an example. This is a different example from the one in the paper, but I think it’s easier to understand as a starting point. The notation is a little different from that of the paper. There doesn’t seem to be a standard, and the paper’s notation is not the most common.

The following borrows heavily from this lecture, which is a model of clarity of exposition, by the way, and is highly recommended.

Let’s suppose we have a collection of news articles and we want to group them by topic. All the sports stories go into one group, the personal finance articles into another, etc. How might we do that?

We’ll use a two-level linear mixing model. First, we’ll assume that each article is a combination of topics. For example, it might be 70% about personal finance, 20% about politics, and 10% about education. Second, we’ll assume that topics are a distribution over words, that is, given a topic, some words are more likely to occur than others. For example, an article containing the words “RRSP” and “savings” suggests one of the topics is personal finance.

We don’t know what the topics are and we don’t know what the relation between words and topics is. NMF is a way to discover those things for us.

To formalize this, we form the word-by-document matrix, \(M\) of size \(m\times n\) , where \(m\) = word index, and \(n\) = document index. Each column of \(M\) corresponds to a document. Each entry \(m_{ij}\) is the relative frequency of the word \(i\) in document \(j\). The term “relative frequency” means we have normalized the word counts so that the columns of \(M\) sum to 1. \(M\) is a very large, very sparse matrix (i.e., most of the \(m_{ij}\) are zero).

We now want to capture the idea that documents are linear combinations of topics and topics are linear combinations of words. We’ll fix the number of topics to \(r\), which is some number that we decided is reasonable. We look for an \(m\times r\) matrix \(A\) and an \(r\times n\) matrix \(W\) for which \(M=AW\). \(A\) captures the distribution of words over topics. \(W\) captures the representation of documents by topics.

Because we want to interpret the entries of \(A\) and \(W\) as probabilities, each element must be nonnegative, i.e., \(\ge0\). The mathematical notation is \(A\in\mathbb{R}_{+}^{m\times r}\), \(W\in\mathbb{R}_{+}^{r\times n}\). We would also like the columns of \(A\) and \(W\) to sum to 1. It turns that if we can find \(A\) and \(W\) such that \(M=AW\), then we can fiddle them so that their columns sum to 1. (The paper shows how.)

The hard part is finding nonnegative \(A\) and \(W\), and in general, there aren’t any choices that will give us exactly \(M=AW\). We have to settle for finding \(A,W\) that get us close to \(M\). “Close” can be defined in various ways, but it’s typical to minimize the square difference between the entries of \(M\) and \(AW\),

$$\left\Vert M-AW\right\Vert _{F}^{2}=\sum_{i,j}\left((M)_{ij}-(AW)_{ij}\right)^{2}$$which is known as the *Frobenius norm*.

To summarize, the goal is of nonnegative matrix factorization (NMF) is to find matrices \(A,W\) that satisfy

$$\min_{A\in\mathbb{R}_{+}^{m\times r},W\in\mathbb{R}_{+}^{r\times n}}\left\Vert M-AW\right\Vert _{F}^{2}$$

NMF is an example of *dimensionality reduction* and *unsupervised learning*.

## Other examples

It seems that matrix factorization is eating the world. The number and types of problems it gets applied to is vast. They share some key characteristics: there are a large number of observations which depend on a relatively small number of unknown factors, and there is a two-level probabilistic model. The differences come from the specifics of the observations and the techniques used to find the matrix factors, which is in general a very hard problem. Here are a few examples.

**Recommendation systems** There’s no one killer app for NMF, but some of its most prominent successes are recommendation systems. This is the thing you see on shopping sites, etc.: “Since you liked that, maybe you’ll like this.” The general approach is known as *collaborative filtering*. For a TV recommender, the observations might be: which shows were watched, for how long, etc. The paper Collaborative Filtering for Implicit Feedback Datasets that was proposed for the reading group is a nice example.

**Blind source separation** In signal processing applications you often have a signal that is a mixture of component signals and you want to separate them. The signals might be speech and music, or multiple speakers, or the notes of a piano chord. This presentation by Paris Smaragdis gives a very clear exposition of the concepts for a music application.

**Hyperspectral imaging** This is the example given in the paper. You have imagery (for example, from a satellite) that was gathered in multiple spectral bands. The model is that each pixel results from a mixture of substances on the ground (soil, water, vegetation, …). You want to identify those components and how much they contribute at each point.

**Face decomposition** This is one of the first applications of NMF, but I have to admit, I don’t really get it. True, it does give you compelling pictures. But NMF doesn’t take advantage of the spatial coherence of the pixels, so to apply it you have to use a set of images that are very well aligned with each other. If you already know where the faces are, what is this doing for you?

## Related techniques

The literature seems full of “such-and-such just turns out to be equivalent to matrix factorization” and vice versa. Some examples that I’d like to understand better (and not all of these are nonnegative, but whatever) are listed below. (The items in quotes are from the Wikipedia article on NMF.)

**Autoencoders**The lecture by Smaragdis cited above has a nice explanation of formulating NMF as an autoencoder, which he goes on to show is more convenient to extend.- “NMF is identical to probabilistic
**latent semantic analysis**” **\(k\)-means**and other**clustering techniques**- “
**SVM**and NMF are related at a more intimate level” **principal components analysis**(PCA) is a very commonly used dimensionality-reduction technique. It is kind of like the not-nonnegative version of NMF and is readily solvable using singular value decomposition (SVD).**GloVe**, an alternative to word2vec for finding word embeddings- …

## Python library routines

`scikit-learn`uses alternating least-squares minimization, which is maybe not the best- Nimfa seems to have a wider variety of algorithms and is actively being developed

## The paper

A few comments by section.

**Introduction**The paper starts with a nice summary of constrained low-rank matrix approximation (CLRMA) which generalizes NMF. It puts things into context and is a good starting point if you want to go beyond NMF.**Nonnegative Matrix Factorization**Quick summary of what NMF is.**Hyperspectral Imaging**This is very nicely set up, but then the author leaves us dangling without presenting any results! The reason is that NMF out of the box doesn’t really solve the problem, it needs additional constraints. This seems to be a common theme in many applications of NMF.**Geometry and Uniqueness***tl;dr the solution is not unique.*The geometrical interpretation would need some digging into to really be understood, but the general idea is important. For a factorization \(M=UV\), the columns of \(M\) are convex combinations (i.e., weighted averages) of the columns of \(U\). The author uses this to give an example that shows that even in the exact case, the decomposition into \(U\) and \(V\) is not unique. It’s not clear how big a problem this is in practice.**Complexity***tl;dr: it’s hard, like NP-hard.*So heuristic techniques are used.**Algorithms**People generally use an iterative scheme with some form of bouncing back and forth between minimizing for \(U\)and then minimizing for \(V\). As an alternative, if you are willing to make additional assumptions about your data (separable NMF) then you can calculate things much faster and with guarantees. See the lecture by Ankur Moitra for a very clear explanation of how this works.

I would have been a little happier if the author had summarized how you go about picking a library routine. For example, scikit-learn seems to just implement alternating least squares whereas Nimfa implements Kullback-Leibler divergence, which some practitioners prefer.**Nonnegative Rank and Extended Formulations**I skipped this. 🙂

Permalink

Bruce, this is an awesome summary! Thank you!

Permalink

Thanks for the lecture link. Great summary as well!