I've been hearing that word2vec It works very well in dealing with the similarity between words , Recently, I started to run Google Open source code (https://code.google.com/p/word2vec/).

1、 corpus

Prepare the data first : Use the whole network news data recommended by the blog on the Internet (SogouCA), The size is 2.1G.

from ftp Download the data package SogouCA.tar.gz:
 wget ftp://ftp.labs.sogou.com/Data/SogouCA/SogouCA.tar.gz --ftp-user=hebin_hit@foxmail.com --ftp-password=4FqLSYdNcrDXvNDi -r

Decompress a packet :

 gzip -d SogouCA.tar.gz
tar -xvf SogouCA.tar

And then what will be generated txt The documents are merged into SogouCA.txt in , Take out the contents of content And transcoding , Get the corpus corpus.txt, The size is 2.7G.

 cat *.txt > SogouCA.txt
cat SogouCA.txt | iconv -f gbk -t utf- -c | grep "<content>" > corpus.txt

2、 participle

use ANSJ Yes corpus.txt Carry out word segmentation , Get the word segmentation result resultbig.txt, The size is 3.1G.

In the word segmentation tool seg_tool First compile and then execute in the directory to get the result of word segmentation resultbig.txt, contains 426221 Word , The total number of times 572308385 individual .
  Segmentation result :
3、 use word2vec Tool training word vector
 nohup ./word2vec -train resultbig.txt -output vectors.bin -cbow -size -window -negative -hs -sample 1e- -threads -binary &

vectors.bin yes word2vec Handle resultbig.txt After the generation of the word vector file , Trained on the lab server 1 One and a half hours .

4、 analysis
4.1 Calculate similar words :
 ./distance vectors.bin

./distance It can be seen as calculating the distance between words , Think of a word as a point in vector space ,distance Consider the distance between points in vector space .

Here are some examples :

4.2 The underlying linguistic laws

In the face of demo-analogy.sh After modification, we get the following examples :
The capital of France is Paris , The capital of England is London , vector(" The French ") - vector(" In Paris, ) + vector(" The British ") --> vector(" London ")"

4.3 clustering

After the segmentation of the corpus resultbig.txt Cluster and sort words according to categories :

1 nohup ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500 &
2 sort classes.txt -k 2 -n > classes_sorted_sogouca.txt

for example :

4.4 Phrase analysis

First, use the corpus after segmentation resultbig.txt Get the file containing words and phrases from sogouca_phrase.txt, Then train the vector representation of words and phrases in the file .

 ./word2phrase -train resultbig.txt -output sogouca_phrase.txt -threshold -debug
./word2vec -train sogouca_phrase.txt -output vectors_sogouca_phrase.bin -cbow -size -window -negative -hs -sample 1e- -threads -binary

Here are a few examples of computing similarity :

5、 Reference link

1. word2vec:Tool for computing continuous distributed representations of words,https://code.google.com/p/word2vec/

2. Play with Chinese Google Open source Deep-Learning project word2vec,http://www.cnblogs.com/wowarsenal/p/3293586.html

3. utilize word2vec Clustering keywords ,http://blog.csdn.net/zhaoxinfan/article/details/11069485

6、 Follow up preparation for careful reading of the literature :

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

[4] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537.

