Posts List

Torchtext snippets

Load separate files data.Field parameters is here. INPUT = data.Field(lower=True, batch_first=True) TAG = data.Field(batch_first=True, unk_token=None, is_target=True) train, val, test = data.TabularDataset.splits(path=base_dir.as_posix(), train='train_data.csv', validation='val_data.csv', test='test_data.csv', format='tsv', fields=[(None, None), ('input', INPUT), ('tag', TAG)]) Load single file all_data = data.TabularDataset(path=base_dir / 'gossip_train_data.csv', format='tsv', fields=[('text', TEXT), ('category', CATEGORY)]) train, val, test = all_data.split([0.7, 0.2, 0.1]) Create iterator train_iter, val_iter, test_iter = data.BucketIterator.splits( (train, val, test), batch_sizes=(32, 256, 256), shuffle=True, sort_key=lambda x: x.input) Load pretrained vector vectors = Vectors(name='cc.

Circular Import in Python

Recently, I found a really good example code for Python circular import, and I’d like to record it here. Here is the code: 1 2 3 4 5 6 7 8 # X.py def X1(): return "x1" from Y import Y2 def X2(): return "x2" 1 2 3 4 5 6 7 8 # Y.py def Y1(): return "y1" from X import X1 def Y2(): return "y2" Guess what will happen if you run python X.

Python Dictionary Implementation

Overview CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as <hash,key,value> in each slot(The slot content changed after Python 3.6). When a new key is added, python use i = hash(key) & mask where mask=table_size-1 to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item.

CSRF in Django

CSRF(Cross-site request forgery) is a way to generate fake user request to target website. For example, on a malicious website A, there is a button, click it will send request to www.B.com/logout. When the user click this button, he will logout from website B unconsciously. Logout is not a big problem, but malicious website can generate more dangerous request like money transfer. Django CSRF protection Each web framework has different approach to do CSRF protection.

Create Node Benchmark in Py2neo

Recently, I’m working on a neo4j project. I use Py2neo to interact with graph db. Alghough Py2neo is a very pythonic and easy to use, its performance is really poor. Sometimes I have to manually write cypher statement by myself if I can’t bear with the slow excution. Here is a small script which I use to compare the performance of 4 diffrent ways to insert nodes. import time from graph_db import graph from py2neo.

Deploy Nikola Org Mode on Travis

Recently, I enjoy using Spacemacs, so I decided to switch to org file from Markdown for writing blog. After several attempts, I managed to let Travis convert org file to HTML. Here are the steps. Install Org Mode plugin First you need to install Org Mode plugin on your computer following the official guide: Nikola orgmode plugin. Edit conf.el Org Mode will convert to HTML to display on Nikola. Org Mode plugin will call Emacs to do this job.

Using Chinese Characters in Matplotlib

After searching from Google, here is easiest solution. This should also works on other languages: import matplotlib.pyplot as plt %matplotlib inline %config InlineBackend.figure_format = 'retina' import matplotlib.font_manager as fm f = "/System/Library/Fonts/PingFang.ttc" prop = fm.FontProperties(fname=f) plt.title("你好",fontproperties=prop) plt.show() Output:

Enable C Extension for gensim on Windows

These days, I’m working on some text classification works, and I use gensim=’s =doc2vec function. When using gensim, it shows this warning message: ``` C extension not loaded for Word2Vec, training will be slow. ``` I search this on Internet and found that gensim has rewrite some part of the code using `cython` rather than `numpy` to get better performance. A compiler is required to enable this feature. I tried to install mingw and add it into the path, but it’s not working.