Low rank approximation in Mixture Models and Recommendation systems
Contents
Low rank approximation in Mixture Models and Recommendation systems¶
Low rank approximation in Mixture Models¶
A mixture model is a probablistic model where the distribution is sum of different probablity densities. Each of these proabablity densities have a weight in the distribution of the mixture model. In short a mixture model is a weighted sum of of probablity densities.
K is the number of distribitions present in the mixture model
\( p_i\) is the probablity density of the \(i^{th}\) distribution
\( w_i\) is the weight associated with the \(i^{th}\) distribution
In this section, we will only consider a gaussian mixture models in which each of the underlying probablity densities are sphereical gaussians.
Clustering of gaussian mixture model¶
We had previously seen that in a gaussian mixture if the distance between centers \( >cd^{1/4}\) then clustering algorithms will be able to identify points from the same distribution in gaussian mixture model.
d is the dimension of the data
In higher dimension, this seperation is usually not met and clustering algorithmns performs poorly. Now hypothetically if we are given the subspace spanned by the K centers then we can project all the sample points to this subspace.
Important
The distance between centers remanins same because we are projecting onto the subspacce which is spanned by the centers
The dimension of the data has decreased from d to k
Typically \( k<<d \) and therefore the minimum distance between centers to distinguish their distribution has decreased. This is benifical for our clustering algorithms and will give us better performace.
Finding subspace spanned by the centers¶
Consider the distribution \( \mu \in \mathbb{R}^{d}, D \equiv N(\mu,\sigma^{2}I) \)
Claim
\(E[{(v^Ta)}^2]\) where \( a\sim D\) is maximised when \({\overrightarrow{v}}\) is \(\frac{\overrightarrow{\mu}}{\lVert \mu \rVert} \)
Proof. \(E[{(v^Ta)}^2] = E{[(v^T(\mu+z))]}^2 = E[{(v^T\mu + v^Tz)}^2], z\sim N(0,\sigma^{2}I)\)
\(= {(v^T\mu)}^2 + 2E[(v^T\mu v^Tz)] + v^TE[zz^T]v \)
\(={(v^T\mu)}^2 + 0 + v^T{({\sigma}^2I)}v \)
\(={(v^T\mu)}^2 + v^Tv \)
This is maximised when v is along \(\frac{\overrightarrow{\mu}}{\lVert \mu \rVert} \)
Before moving on to gaussian mixture models lets try solving the simpler version first.
Simple case: Single-Gaussian¶
We are given a single Gaussian distribution \(D \equiv N(\mu,\sigma^{2}I)\) and let \( a_1,a_2,a_3,...,a_n\) be n points from this distribution. We need to find \(v\) which maximises \( \sum_{i=1}^{n} {(v^Ta_i)}^2\)
Converting the summation into matrix multiplication we get the following
\( \sum_{i=1}^{n} {(v^Ta_i)}^2 = {{{\lVert Av \rVert}}_2}^2\), where \(A= \left(\begin{array} &{a_1}^T\\ {a_2}^T\\ . \\ .\\ .\\ {a_n}^T \end{array}\right)\)
Important
We have seen \({{{\lVert Av \rVert}}_2}^2\) before in SVD and the value of \(v\) which maximises the norm is nothing but its 1’st singluar vector.
We know that \(E[{(v^Ta)}^2] \approx \frac{1}{n}\sum_{i=1}^{n} {(v^Ta_i)}^2\) for large n.
So, the \(\underset{v}{\operatorname{\arg max}} {{{\lVert Av \rVert}}_2}^2\) approximately lies along \({\overrightarrow{\mu}}\) and from the previous lemma we can say that the unit vector along \(\mu\) can be approximated to the 1st singular vector
Gaussian mixture model¶
The same concepts from the single-gaussian case apply over here too but the geomentry gets tricky as we are dealing in d dimensions. Instead of line passing through a single center \(\mu\), here we will deal with a subspace passing through all the centers \(\mu_1,\mu_2,\mu_3,...,\mu_n\)
Applying the same idea from the single-gaussian case:
\(\underset{P}{\operatorname{\arg max }}\) \(E_a{{{\lVert a^TP \rVert}}_2}^2 =\) Subspace spanned by \({[\mu_1,\mu_2,...,\mu_n]}\) and \(a\) comes from the gaussian mixture model. P is of (\(d,k\)) dimension
Where \( P = {[p_1,p_2,...,p_k]}\) and \(p_i \perp p_j, i \neq j\)
We know that \(E_a[{{{\lVert a^TP \rVert}}_F}^2] \approx \frac{1}{n}\sum_{i=1}^{n} {({a_i}^TP)}^2\) for large n.
So to find the subspace spanned by \({[\mu_1,\mu_2,...,\mu_n]}\) we need to find \(\underset{P}{\operatorname{\arg max }}\) \({{{\lVert AP \rVert}}_F}^2\)
If we can find the P then we can project all our sample points on P by computing \(a_iP\)
Note
From Pythagorean theorem: \(\underset{P}{\operatorname{\arg max }}\) \({{{\lVert AP \rVert}}_F}^2 = \underset{P}{\operatorname{\arg min }} {{{\lVert A-AP \rVert}}_F}^2\)
Note
Propery of low rank approximation: \({{{\lVert A-A_k \rVert}}_F} \leq {{{\lVert A-B \rVert}}_F}\), where \(rank(B) \leq k\)
Therefore \( {{{\lVert A-AP \rVert}}_F}^2\) attains its minimum when \(AP = A_k\)
Important
Now back to our initial idea of projecting the sample points on the subspace spanned by the centers .i.e each point \(a_i\) gets mapped to \(a_iP\), since P satisfies the above equation \(a_iP\) is nothing but the ith row of \(A_k\)
So our final strategy for learning Gaussian mixtures is as follows:
Given n samples \(x_1,x_2,...,x_n\), \(x_i \in \mathbb{R}^{d}\) form the matrix \(A= \left(\begin{array} &{a_1}^T\\ {a_2}^T\\ . \\ .\\ .\\ {a_n}^T \end{array}\right)\)
Find \( A_k \)
Apply clustering algorithms on the rows of \(A_k\)
Recommendation systems¶
We are given a rating matrix \(R\) of n rows and m columns, where each row denotes a user and each column denotes a objects. User \(u\) rates the object \(o\) with a rating \(x\) then \(R[u][o] = x\). It is not necessary that every user will rate every object
Goal
To fill the rating matrix \(R\)
We assume that there is an underlying space of interest called latent space.
Preliminary approach¶
Each user and each object is a point in this latent space. Now lets say rating of user \(u\) for object \(o\) .i.e \(R[u][o]\) depends on the closeness of \(u\) and \(o\) in the latent space.
SVD approach¶
Do one of the following to \(R\)
Fill all the empty cells in \(R\) with zeroes.
Normalise \(R\) by subtracting its mean and fill the empty cells with 0.
Fill a empty cell by its row average or its column average
Find \(R_k = P_k \sum_k Q_k\) the low rank approximation of \(R\)
We fill \(R\) by recombining features from the low rank approximation \(R_k\) as \(R_{uo} = \sum_{i=1}^{k} P_{ui} \sigma_i Q_{io} \)
Remark
There is no specific way of choosing \(k\)
One way to go about it is to plot the \({{\sigma}_i}'s\) and choose k based on the \(\sigma\) values. If \({\sigma_i}'s\) after \(i = t \) does not contribute significantly to the variation in the data then choose \( k = t\)
One another way of approaching the problem is to deal with the missing entries as unknown entries.
Then to find \(B\), rank(\(B\)) = \(k\), such that
where