# Preface

Very early on the planning of word segmentation algorithm , It's divided into five parts , I want to talk about the segmentation methods I used in various scenes and make a summary , Things have dragged on until now , I'll make up for the last one today . We've explained in the previous posts that no matter what participle 、 Part of speech tagging or NER, Can be abstracted into a sequence annotation model ,seq2seq, It's mapping one sequence to another , This is in NLP Domains are very common , because NLP The word order 、 Context is very important , So what is the current word , We have to go back and see what we said , Even after what , This is also in line with the human habit of reading comprehension . As a result of abstraction Seq2Seq Model of , Then we can apply the relevant model to solve , such as HMM、CRF And in depth RNN, Let's talk about that in this article LSTM The application of participles , And some in use trick, Like how to add a dictionary .

# Cyclic neural network

In previous posts Mario AI Implementation mode exploration —— neural network + To enhance learning , I explained the history of neural networks , And the beginning of the recent wave of AI CNN, The concept of convolutional neural network . Convolution neural network has brought a qualitative leap in the field of image , It will also be established by Professor Li Feifei ImageNet The game was elevated to new heights , Image recognition field , Computers surpassed humans for the first time , And that's what's happened to AI in the last two or three years 、 A continuing focus on deep learning .

When CNN After the boom in graphics , Nature is one of the three fields of artificial intelligence NLP, It was used very quickly , The famous Text-CNN, You can read this paper if you are interested Convolutional Neural Networks for Sentence Classification, Yes NLP The field also has important milestones , Now the reference has been reached 3436.

however CNN There is a serious problem , There is no concept of sequence in it , If we make a good sentence embedding Throw it CNN In the classification model , that CNN Think of the sentence as a bag of words （bag-of-words bag）, In this way NLP The important word order information in the domain is lost , So here we have RNN, That is, recurrent neural network or recurrent neural network （ Here's what's interesting , If it's a classification model for statements , Then use CNN For different kernel Convolution of , Then, some word order information can be extracted , This also involves various varieties of CNN, You can look it up ）.

For circular neural networks , In fact, with CRF、HMM There's a lot in common , For each input $$x_t$$, We all get a state through network transformation $$h_t$$, For a sequence , every last token（ It can be a word or a word , In participle is a word ） Will go into network iteration , Note that the parameters in the network are shared . You can't avoid putting classic images here ：

Here we expand the circular neural network , Just like that . Notice in the picture below $$A$$, stay RNN This is a relatively simple feedforward neural network , stay RNN There is a serious problem , When the sequence is long ,BP The algorithm is feedback , The gradient goes to zero , It's called gradient disappearance （vanishing gradient） problem , And that leads to this LSTM（Long Short Term Memory）.

LSTM It's essentially a recurrent neural network , It just takes what we mentioned above $$A$$ The change , Three doors were added , It's just a few transformation expressions about vectors , To avoid the gradient vanishing problem , bring LSTM The logical unit of the system can better save the sequence information , You can't avoid the following classic picture ：

There are four corresponding expressions in the figure, as follows ：

Oblivion gate ：

$f_t=\sigma (W_f\cdot [h_{t-1},x_t]+b_f$

Input gate ：

$i_t=\sigma (W_i\cdot [h_{t-1},x_t]+b_i$

$\widetilde{C}=tanh(W_C\cdot [h_{t-1},x_t]+b_C$

Status update ：

$C_t=f_t*C_{t-1}+i_t*\widetilde{C}_t$

Output gate ：

$O_t=\sigma (W_o[h_{t-1},x_t]+b_o)$

$h_t=O_t*tanh(C_t)$

In general? LSTM It's all in one direction to loop the sequence into the network , However, sometimes we need to focus on the information of the sequence on both sides , So that leads to this Bi-LSTM, The two-way LSTM, It's simple , That's for a sequence , We have two LSTM The Internet , A sequence of forward inputs , A reverse input sequence , Then will output state Splice together , For subsequent use .

So far, let's talk about recurrent neural networks , Now let's look at the application in word segmentation LSTM

# be based on LSTM Word segmentation

Previous articles and previous series of blog articles , We are already familiar with the conversion of participle to Seq2Seq The idea of , So for LSTM, What we need to do is map a string of sentences into Embedding, Then output one by one to the network , Get the state output , Sequence tagging . We use TensorFlow To develop .

## Embedding

About Embedding, We can download it directly from the Internet Wiki Data set trained Embedding, The general dimension is 100, You can also do it on your own , utilize Word2Vec、Fasttext Wait to train their own Embedding.

## Data preprocessing

In fact, many models of depth are very mature , The most troublesome is data preprocessing , In the data preprocessing phase, the core is to map the sequence to Embedding File corresponding id Sequence , And in accordance with the Batch To slice , It is usually set according to the size of the data set 64、128、256 Such as different batch size , I'm entering data into the network , Conduct epoch iteration , Pay attention to carry out the necessary shuffle operation , It's useful for improving results ,shuffle It's like this ：

def shuffle(char_data, tag_data, dict_data, len_data):
char_data = np.asarray(char_data)
tag_data = np.asarray(tag_data)
dict_data = np.asarray(dict_data)
len_data = np.asarray(len_data)
idx = np.arange(len(len_data))
np.random.shuffle(idx)
return (char_data[idx], tag_data[idx], dict_data[idx], len_data[idx])


## Model

Our core model structure is also very simple , Will input id Sequence , adopt Tensorflow A table lookup operation , Map to the corresponding Embedding, Then enter it into the network , Get the final result , Conduct Decode operation , Gets a token for each character （BEMS）, The core code is as follows ：

 def __init__(self, config, init_embedding = None):
self.batch_size = batch_size = config.batch_size
self.embedding_size = config.embedding_size # column
self.hidden_size = config.hidden_size
self.vocab_size = config.vocab_size # row
# Define input and target tensors
self._input_data = tf.placeholder(tf.int32, [batch_size, None], name="input_data")
self._targets = tf.placeholder(tf.int32, [batch_size, None], name="targets_data")
self._dicts = tf.placeholder(tf.float32, [batch_size, None], name="dict_data")
self._seq_len = tf.placeholder(tf.int32, [batch_size], name="seq_len_data")
with tf.device("/cpu:0"):
if init_embedding is None:
self.embedding = tf.get_variable("embedding", [self.vocab_size, self.embedding_size], dtype=data_type())
else:
self.embedding = tf.Variable(init_embedding, name="embedding", dtype=data_type())
inputs = tf.nn.embedding_lookup(self.embedding, self._input_data)
inputs = tf.nn.dropout(inputs, config.keep_prob)
inputs = tf.reshape(inputs, [batch_size, -1, 9 * self.embedding_size])
d = tf.reshape(self._dicts, [batch_size, -1, 16])
self._loss, self._logits, self._trans = _bilstm_model(inputs, self._targets, d, self._seq_len, config)
# CRF decode
self._viterbi_sequence, _ = crf_model.crf_decode(self._logits, self._trans, self._seq_len)
with tf.variable_scope("train_ops") as scope:
# Gradients and SGD update operation for training the model.
self._lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables() # all variables need to train
global_step=tf.contrib.framework.get_or_create_global_step())
self._new_lr = tf.placeholder(data_type(), shape=[], name="new_learning_rate")
self._lr_update = tf.assign(self._lr, self._new_lr)
self.saver = tf.train.Saver(tf.global_variables())


The logic of the code is clear , After you get all the inputs ,embedding After looking up the table , Put in Bi-LSTM Model , The results obtained are carried out Decode, Notice that we used one here CRF In the tail Decode, The result is better after the experiment , Actually, it goes up one level Softmax also ok. about bilstm as follows ：

def _bilstm_model(inputs, targets, dicts, seq_len, config):
'''
@Use BasicLSTMCell, MultiRNNCell method to build LSTM model
@return logits, cost and others
'''
batch_size = config.batch_size
hidden_size = config.hidden_size
vocab_size = config.vocab_size
target_num = config.target_num # target output number
seq_len = tf.cast(seq_len, tf.int32)
fw_cell = lstm_cell(hidden_size)
bw_cell = lstm_cell(hidden_size)
with tf.variable_scope("seg_bilstm"): # like namespace
# we use only one layer
(forward_output, backward_output), _ = tf.nn.bidirectional_dynamic_rnn(
fw_cell,
bw_cell,
inputs,
dtype=tf.float32,
sequence_length=seq_len,
scope='layer_1'
)
# [batch_size, max_time, cell_fw.output_size]/[batch_size, max_time, cell_bw.output_size]
output = tf.concat(axis=2, values=[forward_output, backward_output]) # fw/bw dimension is 3
if config.stack: # False
(forward_output, backward_output), _ = tf.nn.bidirectional_dynamic_rnn(
fw_cell,
bw_cell,
output,
dtype=tf.float32,
sequence_length=seq_len,
scope='layer_2'
)
output = tf.concat(axis=2, values=[forward_output, backward_output])
output = tf.concat(values=[output, dicts], axis=2) # add dicts to the end
# outputs is a length T list of output vectors, which is [batch_size*maxlen, 2 * hidden_size]
output = tf.reshape(output, [-1, 2 * hidden_size + 16])
softmax_w = tf.get_variable("softmax_w", [hidden_size * 2 + 16, target_num], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [target_num], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
logits = tf.reshape(logits, [batch_size, -1, target_num])
with tf.variable_scope("loss") as scope:
# CRF log likelihood
log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(
logits, targets, seq_len)
loss = tf.reduce_mean(-log_likelihood)
return loss, logits, transition_params


Notice we did it twice LSTM, And spliced the results together , And our loss function is about crf_log_likelihood.

## How do I add a user dictionary

We can see that after the whole model is trained ,inference This process is directly based on the weight of the network , So how do you add a user dictionary , The way we use here is to splice user dictionaries as additional features in Bi-LSTM After the result , That's the code up there output = tf.concat(values=[output, dicts], axis=2) # add dicts to the end here , The dictionary will be divided into four parts ,head、mid、single、tail, Pref. 、 In the word 、 Endings and single words , This is useful for user dictionaries one-hot In the form of expression , However, in the actual use process, there are still problems that can not be cut out , Readers can consider strengthening these features .

I'm going to put the whole code here github Yes , Interested readers directly look at the source code , Please leave a message if you have any questions ~

https://github.com/xlturing/machine-learning-journey/tree/master/seg_bilstm

I finally wrote the series , And then thank you for doing it recently Attention、Transformer as well as BERT This set of applications in text classification , Welcome to the exchange .

## On word segmentation algorithm （5） Word segmentation based on word （bi-LSTM） More articles about

1. On word segmentation algorithm （4） Word segmentation based on word （CRF）

Catalog Preface Catalog Conditional random field (conditional random field CRF) emphasis Linear chain element random field Simplified form CRF participle CRF VS HMM Code implementation Training code experimental result reference ...

2. On word segmentation algorithm （3） Word segmentation based on word （HMM）

Catalog Preface Catalog hidden Markov model (Hidden Markov Model,HMM) HMM participle Two hypotheses Viterbi Algorithm Code implementation Realization effect Complete code reference Preface On the word segmentation algorithm (1) In the participle ...

3. Word segmentation algorithm based on word segmentation method （HMM）

Preface On the word segmentation algorithm (1) We have discussed two kinds of word segmentation: dictionary based word segmentation and word based word segmentation , On the word segmentation algorithm (2) In this paper, we use n-gram The method of word segmentation based on dictionary is realized . stay (1) in , We also discussed ...

4. On LAN ARP The harm of the attack and the prevention method ( chart )

On LAN ARP The harm of the attack and the prevention method ( chart )   author : Ice Shield firewall Website :www.bingdun.com date :2015-03-03   since last year 5 In May, the local area network on campus began to drop frequently , Normal education and teaching ...

5. Talking about Tarjan Algorithms and ideas

In digraph G in , If there is at least one path between two vertices , We call two vertices strongly connected (strongly connected). If there is a digraph G Every two vertices of are strongly connected , call G It's a strongly connected graph . The maximal strongly connected subgraph of a directed graph of a non strongly connected graph , It's called qianglian ...

Catalog sketch effect Tarjan Algorithm principle The people who came out Icon Code implementation Example Example 1 Example 2 Example 3 Example 4 Example 5 summary sketch For beginners Tarjan For you , I'm sure you and I learned from the beginning Tarjan None of the same ...

7. Talking about Manacher Algorithms and extensions KMP The connection between

First , Talking about Manacher Algorithm before , Let's look at a small problem first : Given a string S, Find the length of the longest palindrome substring of the string . For the solution of this problem . There are many solutions on the Internet . The time complexity is not the same , Here are some common solutions . Solution 1   ...

8. [Machine Learning] Talking about LR Algorithm Cost Function

understand LR All the students know that ,LR The minimum cross entropy or maximum likelihood estimation function is used as Cost Function, That's an interesting question , Why don't we use the simpler and more familiar function of minimizing squared error (MSE) Well ? I personally understand ...

So let's start here Preliminary knowledge Two arrays Tarjan Application of algorithm Find cut point and cut edge Find the point - Double connected components Seeking the edge - Double connected components Find the strongly connected component Preliminary knowledge Set up undirected graph $G_{0} = (V_{0}, E_{0})$, among \$V_ ...

## Random recommendation

1. [ correct ] The curve of mobile platform is not smooth （ Such as ：TRectangle, TPath... etc. ）

problem : from  XE4  since ,Firemonkey  The problem that curve drawing is not smooth on the mobile platform has always been criticized , Submitted to the official QC It is also too complicated to be prepared , Officials seem to have deliberately avoided this issue , It's been a long time . Applicable version :XE4 ~ Ber ...

2. IE Next a On the back of the label span The element floats to the right and dislocates

The reason for the error span On the a After tag The correct way to write it is to put it before as follows : <li><span>2016-07-29</span><a href="#" ...

3. hdu1042 N!

/* N! Time Limit: 10000/5000 MS (Java/Others) Memory Limit: 262144/262144 K (Java/Others) Total Subm ...

Because I've used the drop-down menu before , So now take a moment to go back and sort it out , Gradually improve the drop-down menu , And provide some basic functions , So that it can be reused if necessary in the future , And provide reference for those in need . The drop-down menu is also divided into data source and ...

5. [ translate ]36 Days of Web Testing( 3、 ... and )

Day 14: Automate the tedious Why ? Sometimes ,web The test is still quite tedious , Before you start the test , You may have to jump to a specific form page , Or to get a specific page ( Or configuration ), you ...

6. Oracle bone inscriptions promote Java March “ The Internet of things ”

The company hopes to work on embedded device development projects Java Can replace C      With Tuesday's announcement on Embedded Java Version upgrade , Oracle hopes to extend the platform to a new generation of connected devices , Also known as the Internet of things . Oracle also hopes that ,Java In some embedded development projects ...

7. python Memo

list&tuple operation multiply constant >>> x = ((1,2),) >>> x*2 ((1, 2), (1, 2)) >>> ...

8. The mathematical way of thinking -sasMEMO(17)

SAS Date and time formats data  _null_;input mydate YYMMDD10.;put mydate YYMMDDB10.;put mydate YYMMDDC10.;put mydat ...

9. mysql appear Can&#39;t connect to MySQL server on &#39;localhost&#39; (10061） Solutions for

A way to search on the Internet : Today mysql Copy the database to another machine , It didn't work , newspaper “Can't connect to MySQL server on 'localhost' (10061)“ error Go online sear ...

10. iOS Development Overview UIkit Dynamics , about UIKit Of Dynamic characteristic ,UIkit Dynamics is UIkit The framework simulates some of the characteristics of the real world .

forward :http://my.oschina.net/u/1378445/blog/335014 iOS UIKit Dynamics Dynamics UIAttachmentBehavior Realization iMessage ...