I've been hearing that word2vec It works very well in dealing with the similarity between words , Recently, I started to run Google Open source code (https://code.google.com/p/word2vec/).

1、 corpus

Prepare the data first : Use the whole network news data recommended by the blog on the Internet (SogouCA), The size is 2.1G.

from ftp Download the data package SogouCA.tar.gz:
 wget ftp://ftp.labs.sogou.com/Data/SogouCA/SogouCA.tar.gz --ftp-user=hebin_hit@foxmail.com --ftp-password=4FqLSYdNcrDXvNDi -r

Decompress a packet :

 gzip -d SogouCA.tar.gz
tar -xvf SogouCA.tar

And then what will be generated txt The documents are merged into SogouCA.txt in , Take out the contents of content And transcoding , Get the corpus corpus.txt, The size is 2.7G.

 cat *.txt > SogouCA.txt
cat SogouCA.txt | iconv -f gbk -t utf- -c | grep "<content>" > corpus.txt

2、 participle

use ANSJ Yes corpus.txt Carry out word segmentation , Get the word segmentation result resultbig.txt, The size is 3.1G.

In the word segmentation tool seg_tool First compile and then execute in the directory to get the result of word segmentation resultbig.txt, contains 426221 Word , The total number of times 572308385 individual .
  Segmentation result :
3、 use word2vec Tool training word vector
 nohup ./word2vec -train resultbig.txt -output vectors.bin -cbow -size -window -negative -hs -sample 1e- -threads -binary &

vectors.bin yes word2vec Handle resultbig.txt After the generation of the word vector file , Trained on the lab server 1 One and a half hours .

4、 analysis
4.1 Calculate similar words :
 ./distance vectors.bin

./distance It can be seen as calculating the distance between words , Think of a word as a point in vector space ,distance Consider the distance between points in vector space .

Here are some examples :

4.2 The underlying linguistic laws

In the face of demo-analogy.sh After modification, we get the following examples :
The capital of France is Paris , The capital of England is London , vector(" The French ") - vector(" In Paris, ) + vector(" The British ") --> vector(" London ")"

4.3 clustering

After the segmentation of the corpus resultbig.txt Cluster and sort words according to categories :

1 nohup ./word2vec -train resultbig.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500 &
2 sort classes.txt -k 2 -n > classes_sorted_sogouca.txt

for example :

4.4 Phrase analysis

First, use the corpus after segmentation resultbig.txt Get the file containing words and phrases from sogouca_phrase.txt, Then train the vector representation of words and phrases in the file .

 ./word2phrase -train resultbig.txt -output sogouca_phrase.txt -threshold -debug
./word2vec -train sogouca_phrase.txt -output vectors_sogouca_phrase.bin -cbow -size -window -negative -hs -sample 1e- -threads -binary

Here are a few examples of computing similarity :

5、 Reference link

1. word2vec:Tool for computing continuous distributed representations of words,https://code.google.com/p/word2vec/

2. Play with Chinese Google Open source Deep-Learning project word2vec,http://www.cnblogs.com/wowarsenal/p/3293586.html

3. utilize word2vec Clustering keywords ,http://blog.csdn.net/zhaoxinfan/article/details/11069485

6、 Follow up preparation for careful reading of the literature :

[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
[2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
[3] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013.

[4] Collobert R, Weston J, Bottou L, et al. Natural language processing (almost) from scratch[J]. The Journal of Machine Learning Research, 2011, 12: 2493-2537.

Run with Chinese data Google Open source project word2vec More articles about

  1. Google Open source project style guide

    Google Open source project style guide source  https://github.com/zh-google-styleguide/zh-google-styleguide Google Open source project style guide ( Chinese version ) ...

  2. 35 Something you may not know Google Open source project

    Reprinted from :http://blog.csdn.net/cnbird2008/article/details/18953113 Google One of the biggest companies to support the open source movement , They've released more than 500 Open source ...

  3. Google Open source project style guide

    Python Style standard A semicolon Tip Don't put a semicolon at the end of the line , And don't use semicolons to put two commands on the same line . Line length Tip No more than 80 Characters exception : Long import module statement Annotated URL Don't use backslashes to connect lines . Py ...

  4. Google Open source project style guide reading notes (C++ edition )

    Although it's a programming style guide , But there's a lot of dry goods , A lot C++ Useful technology in it . The header file Usually every one .cpp All the files correspond to one .h file :#define Protecting all header files should use #define Prevent header files from being multiple included , To ensure that only ...

  5. PYTHON Style standard -Google Open source project style guide

    Python Style standard A semicolon Tip Don't put a semicolon at the end of the line , And don't use semicolons to put two commands on the same line . Line length Tip No more than 80 Characters exception : Long import module statement Annotated URL Don't use backslashes to connect lines . Py ...

  6. Google Style guide for open source projects

    Google C++ Code style guide . Code for agriculture . What's more, it's not FQ, The decisive favorite !! . http://zh-google-styleguide.readthedocs.org/en/latest/google ...

  7. google Open source project reading program

    1. glog 2. gflags 3. carto 4. ...

  8. A natural language processor _ Related resources _ Open source project ( such as : participle ,word2vec etc. )

    (1) Doctor, Institute of automation, Chinese Academy of Sciences , Using neural networks for natural language processing :http://licstar.net (2) Participle project :https://github.com/fxsjy/jieba(3) Chinese word segmentation by Tsinghua University ...

  9. Github on iOS All kinds of open source projects ( It is strongly recommended that you collect , see , There's always one that you need )

    The drop-down refresh EGOTableViewPullRefresh - The earliest drop-down refresh control . SVPullToRefresh - Drop down refresh control . MJRefresh - It only takes one line of code to UITableVie ...

Random recommendation

  1. 001.Getting Started -- 【 Getting started 】

    Getting Started Getting started 662 of 756 people found this helpful Meng.Net Self translation 1. Install .NET Core Go to the official website to install .NE ...

  2. HTML maze

    I'd like to add a addendum today , It will be studied a long time ago HTML5 It's a maze when I write it . Maze pathfinding program . The download link is at the end of the article . brief introduction Why do you do this HTML5 Maze program ? Because I like . I am willing to . It's also learning from old programmers ( See the first 5 section ) ...

  3. Visual Studio 14 First try ,vNext

    It's been a few days VS 2014 . It's finally installed , It took days , VS 2014  Download address , http://www.visualstudio.com/en-us/downloads/visual-stud ...

  4. iOS The local store - database (FMDB)

    First time to know FMDB iOS The sound of the Central Plains SQLite API When it comes to data storage , Need to use C Functions in language , The operation is more troublesome , So there's a series of things that will SQLite Wrap the library . This article explains FMDB It's one of them . FMDB PK ...

  5. C# ArrayList

    One . Definition System.Collections.ArrayList Class is a special array ( Dynamic array ). By adding and deleting elements , You can change the length of the array dynamically . Two . advantage Adding and deleting elements dynamically , Realized ICollec ...

  6. sqlserver -- Learning notes ( 6、 ... and ) Date format conversion

    Forget where this article came from , And then copy it and save it , Thanks for sharing ~ ) ::08 ),'-',''),' ',''),':','') ),'/','-') ) , ) ) , ) ) , ) ) , ) ) , ...

  7. [html5] Learning notes - Server push Events

    1.HTML5 Introduction of server push events Server push Events (Server-sent Events) yes Html5 An integral part of the norm , It can be used to push data from server to browser in real time . Traditional server push technology ----WebSo ...

  8. Ordered linear table ( Storage structure array )--Java Realization

    /* Ordered array : The main purpose is to improve the efficiency of search * lookup : Unordered array -- In order to find , Ordered array -- Binary search * Where insertion is slower than unordered arrays * */ public class MyOrderedArray { private ...

  9. java call webservice Interface

    webservice Of Publishing usually uses WSDL(web service descriptive language) File style to publish , stay WSDL In the document , Include this webservice To expose to the outside for ...

  10. Smart contract language Solidity Tutorial series 2 - Address type introduction

    Solidity The second in the tutorial series - Solidity Address type introduction . Write it at the front Solidity It's the Ethereum smart contract programming language , Before reading this article , You should be right about Ethereum . I know something about smart contracts , If you don't already know , I suggest you look at Ethereum first ...