All Posts

Circular Import in Python

Recently, I found a really good example code for Python circular import, and I’d like to record it here. Here is the code: 1 2 3 4 5 6 7 8 # def X1(): return "x1" from Y import Y2 def X2(): return "x2" 1 2 3 4 5 6 7 8 # def Y1(): return "y1" from X import X1 def Y2(): return "y2" Guess what will happen if you run python X.

Python Dictionary Implementation

Overview CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as <hash,key,value> in each slot(The slot content changed after Python 3.6). When a new key is added, python use i = hash(key) & mask where mask=table_size-1 to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item.

TextCNN with Pytorch and Torchtext on Colab

PyTorch is a really powerful framework to build the machine learning models. Although some features is missing when compared with TensorFlow (For example, the early stop function, History to draw plot), its code style is more intuitive. Torchtext is a NLP package which is also made by =pytorch= team. It provide a way to read text, processing and iterate the texts. Google Colab is a Jupyter notebook environment host by Google, you can use free GPU and TPU to run your modal.

CSRF in Django

CSRF(Cross-site request forgery) is a way to generate fake user request to target website. For example, on a malicious website A, there is a button, click it will send request to When the user click this button, he will logout from website B unconsciously. Logout is not a big problem, but malicious website can generate more dangerous request like money transfer. Django CSRF protection Each web framework has different approach to do CSRF protection.

Create Node Benchmark in Py2neo

Recently, I’m working on a neo4j project. I use Py2neo to interact with graph db. Alghough Py2neo is a very pythonic and easy to use, its performance is really poor. Sometimes I have to manually write cypher statement by myself if I can’t bear with the slow excution. Here is a small script which I use to compare the performance of 4 diffrent ways to insert nodes. import time from graph_db import graph from py2neo.

Deploy Nikola OrgMode on Travis

Recently, I enjoy using Spacemacs, so I decided to switch to org file from Markdown for writing blog. After several attempts, I managed to let Travis convert org file to HTML. Here are the steps. Install orgmode plugin First you need to install orgmode plugin on your computer following the official guide: Nikola orgmode plugin. Edit conf.el OrgMode will convert to HTML to display on Nikola. Orgmode plugin will call Emacs to do this job.

Using Chinese Characters in Matplotlib

After searching from Google, here is easiest solution. This should also works on other languages: import matplotlib.pyplot as plt %matplotlib inline %config InlineBackend.figure_format = 'retina' import matplotlib.font_manager as fm f = "/System/Library/Fonts/PingFang.ttc" prop = fm.FontProperties(fname=f) plt.title("你好",fontproperties=prop) Output:


LSTM The avoid the problem of vanishing gradient and exploding gradient in vanilla RNN, LSTM was published, which can remember information for longer periods of time. Here is the structure of LSTM: The calculate procedure are: \[\begin{aligned} f_t&=\sigma(W_f\cdot[h_{t-1},x_t]+b_f)\\ i_t&=\sigma(W_i\cdot[h_{t-1},x_t]+b_i)\\ o_t&=\sigma(W_o\cdot[h_{t-1},x_t]+b_o)\\ \tilde{C_t}&=tanh(W_C\cdot[h_{t-1},x_t]+b_C)\\ C_t&=f_t\ast C_{t-1}+i_t\ast \tilde{C_t}\\ h_t&=o_t \ast tanh(C_t) \end{aligned}\] \(f_t\),\(i_t\),\(o_t\) are forget gate, input gate and output gate respectively. \(\tilde{C_t}\) is the new memory content. \(C_t\) is cell state. \(h_t\) is the output.

Models and Architectures in Word2vec

Models CBOW (Continuous Bag of Words) Use the context to predict the probability of current word. Context words’ vectors are \(\upsilon_{c-n} … \upsilon_{c+m}\) (\(m\) is the window size) Context vector \(\hat{\upsilon}=\frac{\upsilon_{c-m}+\upsilon_{c-m+1}+…+\upsilon_{c+m}}{2m}\) Score vector \(z_i = u_i\hat{\upsilon}\), where \(u_i\) is the output vector representation of word \(\omega_i\) Turn scores into probabilities \(\hat{y}=softmax(z)\) We desire probabilities \(\hat{y}\) match the true probabilities \(y\). We use cross entropy \(H(\hat{y},y)\) to measure the distance between these two distributions.

Semi-supervised text classification using doc2vec and label spreading

Here is a simple way to classify text without much human effort and get a impressive performance. It can be divided into two steps: Get train data by using keyword classification Generate a more accurate classification model by using doc2vec and label spreading Keyword-based Classification Keyword based classification is a simple but effective method. Extracting the target keyword is a monotonous work. I use this method to automatic extract keyword candidate.