As I said before, I’m working on a text classification project. I use doc2vec to convert text into vectors, then I use LPA to classify the vectors.

LPA is a simple, effective semi-supervised algorithm. It can use the density of unlabeled data to find a hyperplane to split the data.

Here are the main stop of the algorithm:

  1. Let $ (x_1,y1)…(x_l,y_l)$ be labeled data, $Y_L = \{y_1…y_l\} $ are the class labels. Let \((x_{l+1},y_{l+u})\) be unlabeled data where \(Y_U = \{y_{l+1}…y_{l+u}\}\) are unobserved, usually \(l \ll u\). Let \(X=\{x_1…x_{l+u}\}\) where \(x_i\in R^D\). The problem is to estimate \(Y_U\) for \(X\) and \(Y_L\).
  2. Calculate the similarity of the data points. The most simple metric is Euclidean distance. Use a parameter \(\sigma\) to control the weights.

\[w_{ij}= exp(-\frac{d^2_{ij}}{\sigma^2})=exp(-\frac{\sum^D_{d=1}{(x^d_i-x^d_j})^2}{\sigma^2})\]

Larger weight allow labels to travel through easier.

  1. Define a \((l+u)*(l+u)\) probabilistic transition matrix \(T\)

\[T_{ij}=P(j \rightarrow i)=\frac{w_{ij}}{\sum^{l+u}_{k=1}w_{kj}}\]

\(T_{ij}\) is the probability to jump from node \(j\) to \(i\). If there are \(C\) classes, we can define a \((l+u)*C\) label matrix \(Y\), to represent the probability of a label belong to class \(c\). The initialization of unlabeled data points is not important.

  1. Propagate \(Y \leftarrow TY\)
  2. Row-normalize Y.
  3. Reset labeled data’s Y. Repeat 3 until Y converges.

In short, let the nearest label has larger weight, then calculate each label’s new label, reset labeled data’s label, repeat.


  1. Learning from Labeled and Unlabeled Data with Label Propagation
  2. 标签传播算法(Label Propagation)及Python实现