### natural language processing （PyTorch edition ）

# PyTorch natural language processing

# natural language processing - Basic introduction

In this paper, the title ：Natural-Language-Processing-with-PyTorch（ One ）

The authors ：Yif Du

Release time ：2018 year 12 month 17 Japan - 09:12

The last update ：2019 year 02 month 16 Japan - 21:02

Original link ：http://yifdu.github.io/2018/12/17/Natural-Language-Processing-with-PyTorch（ One ）/

license agreement ： A signature - Noncommercial use - No derivatives 4.0 The international Please keep the original link and the author for reprint .

image **Echo (Alexa）、Siri And Google translate ** Such well-known product names have at least one thing in common . They are all ** natural language processing （NLP） Product of application **,NLP Is one of the two main themes of the book .

NLP It's a set of techniques that use statistical methods , With or without linguistic insight , Understanding text in order to solve real-world tasks . This influence on the text “** understand **”** Mainly by converting text into a usable computational representation **, These calculations represent ** Discrete or continuous composite structures , Such as vectors or tensors 、 Graphics and trees **.

from ** data （ In this case, the text ） Learn a representation suitable for the task ** Is the subject of machine learning . Text data using machine learning has a history of more than 30 years , But lately （2008 - 2010 Year begins ) [1] A set of machine learning techniques , It's called deep learning , Continue to develop and prove very effective artificial intelligence （AI） stay NLP The task , speech , And computer vision . Deep learning is another topic we want to talk about ; therefore , This book is about NLP And deep learning .

In short ,** Deep learning ** Enable people to use a method called ** Abstract concepts of computational graph and digital Optimization Technology ** Effectively learn representation from data . This is the success of deep learning and computing graphs , Like Google 、Facebook and Amazon Such large technology companies have released the implementation of their computing graphics framework and Library , To capture the thinking of researchers and engineers .

In this book , We consider the **PyTorch**, An increasingly popular based on python To implement the deep learning algorithm . In this chapter , We will explain what is a calculation diagram , And we choose to use **PyTorch** As a framework .

The field of machine learning and deep learning is broad . In this chapter , Most of the time in this book , Our main consideration is the so-called supervised learning ; in other words , Use the labeled training examples to learn . We explain the supervised learning paradigm , This will be the basis of this book . If you are not familiar with many of these terms so far , So you are right .

This chapter , And future chapters , Not only clarify this point , And studied them deeply . If you are already familiar with some of the terms and concepts mentioned here , We still encourage you to follow the following two reasons : Create a shared vocabulary for the rest of the book , And fill any gaps needed to understand future chapters .

** The goal of this chapter is :**

- Develop a clear understanding of the supervised learning paradigm , Understand the terms , And develop a conceptual framework to deal with the learning tasks of future chapters
- Learn how to code the input of the learning task
- Understand what a graph is
- master PyTorch Basic knowledge of

** Let's get started ！**

## Supervised learning paradigm

** machine learning ** Medium ** supervise **, Or simply supervised learning , It means to be ** The goal is （ What is predicted ） The truth of is used to observe （ Input ） The situation of **.

for example , stay ** Document classification ** in ,** The goal is ** It's a ** Category labels **,** Observe （ Input ）** It's a document .

for example , stay ** Machine translation ** in , Observe （ Input ） It's a sentence in a language , The target is a sentence in another language . By understanding the input data , We are in the picture 1-1 The supervised learning paradigm is demonstrated in .

We can decompose the supervised learning paradigm into six main concepts , As shown in the figure : Observe ： ** Observation is what we want to predict **.

We use it `x`

It means the observed value . We sometimes call Observations “** Input **”. The goal is : ** The target is the label corresponding to the observation **. It's usually something predicted .

According to machine learning / Standard symbols in deep learning , We use it `y`

Said these . Sometimes , This is called the real situation . Model : A model is a mathematical expression or function , It accepts an observation `x`

, And predict the value of its target tag . Parameters : Sometimes called weight , These parametric models . Symbols used in the standard `w`

（ The weight ） or `w_hat`

.

** forecast **: forecast , Also known as ** It is estimated that **, It is the value of the target guessed by the model when the observation value is given . We use one `hat`

Said these . therefore , The goal is `y`

The prediction of `y_hat`

To express .

** Loss function **: ** The loss function is a function of comparing the distance between the prediction and the observed target in the training data **.

Given a goal and its prediction , The loss function will assign a scalar real value called loss .** The lower the loss value , The better the prediction effect of the model on the target . We use it L The loss function **.

Although in NLP / Deep learning modeling or writing this book , This is not officially valid in Mathematics , But we will formally restate the supervised learning paradigm , In order to provide standard terminology for new readers in this field , So that they are familiar with arXiv Research on symbols and writing styles in papers .

Consider a data set `D={X[i],y[i]}, i=1..n`

, Yes `n`

An example . Given this data set , We want to learn one by weight `w`

Parameterized functions （ Model )`f`

, in other words , We are right. `f`

Make a hypothesis about the structure of , Given this structure , A weight `w`

The learning value will fully characterize the model .

For a given input `X`

, Model to predict `y_hat`

As the target : `y_hat = f(X;W)`

In supervised learning , For training examples , We know the real goal of observation `y`

. The loss of this instance will be `L(y, y_hat)`

.

then ,** Supervised learning becomes a search for the optimal parameters / A weight w The process of **, So that all

`n`

An example of cumulative loss minimization . utilize **（ Random ） Gradient descent method ** Training ** Supervised learning ** The goal is to select parameter values for a given dataset , Minimize the loss function . let me put it another way , This is equivalent to finding the root in the equation .

We know ** Gradient descent method ** It is a common ** Find the root of the equation ** Methods . recall , In the traditional gradient descent method , We're right about root （ Parameters ） Some initial values of ,** And iteratively update these parameters **, Until the objective function （ Loss function ） The calculated value of is below the acceptable threshold （ Convergence criterion ）.

For large datasets , Due to memory limitations , It is usually impossible to achieve traditional gradient descent on the whole data set , And because of the computational overhead , Very slow . contrary , An approximate ** Gradient descent is called random gradient descent **（SGD）.

In random cases , Data points or subsets of data points are randomly selected , And calculate the gradient of the subset . When using a single data point , This method is called pure SGD, When using （ Multiple ） A subset of data points , Let's call this ** Small batch SGD**.

Usually ,“ pure ” and “ Small batch ” These two words are deleted when they become clear according to the context . in application , Pure... Is rarely used SGD, Because it will lead to very slow convergence due to noisy updates . commonly SGD There are different variants of the algorithm , All for faster convergence . In later chapters , We will explore some of these variants , And how to use gradients to update parameters .** This iterative process of updating parameters is called back propagation **. Every step of back propagation （ Also known as cycle ） from ** Forward pass ** and ** Pass back ** form .

** Pass forward calculates the input with the current value of the parameter and calculates the loss function .**

** The back pass updates the parameters with a loss gradient .**

Please note that , up to now , Nothing here is specific to deep learning or neural networks . chart 1-1 The direction of the arrow in indicates the direction of the data when training the system “ flow ”.

About ** Training ** and “** Calculation chart **” in “** flow **” The concept of , We have more to say , But first of all , Let's see how to use numbers to represent NLP Input and objectives in the question , In this way, we can train the model and predict the results .

## Observation and target coding

We need to use ** Numbers represent observations （ Text ）**, For use with machine learning algorithms . chart 1-2 A visual description is given .

** For text ** A simple way to express... Is to use a number vector . There are countless ways to perform this mapping / Express . in fact , Much of this book is devoted to learning such task representations from data . However , Let's start with some simple count based representations based on Heuristics .

Simple though , But they are very powerful , Or it can be used as a richer starting point for learning . All these count based representations start with a fixed dimensional vector .

### One-Hot Express

seeing the name of a thing one thinks of its function , Single heat means starting with a zero vector , If the word appears in a sentence or document , Set the corresponding entry in the vector to 1. Consider the following two sentences .

```
Time flies like an arrow.
Fruit flies like a banana.
```

Mark the sentence , Ignore punctuation , And all words are represented in lowercase letters , You get a size of 8 The vocabulary of :`{time, fruit, flies, like, a, an, arrow, banana}`

. therefore , We can use an eight dimensional single heat vector to represent each word . In this book , We use `1[w]`

To mark / word `w`

The single heat of .

For the phrase 、 Sentence or document , The single heat representation of compression is only the single heat representation of the logical or of its constituent words . Use the figure 1-3 Code shown , The phrase `like a banana`

The single heat representation of will be a `3×8`

matrix , The columns are 8 One dimensional heat vector . You usually see “ Fold ” Or binary encoding , Where text / Phrases are represented by vectors of vocabulary length , use 0 and 1 Indicates the absence or presence of a word .`like a banana`

The binary code of is :`[0,0,0,1,1,0,0,1]`

.

Be careful ： At this point , If you think we put `flies`

Two different meanings of （ Or feel ） Confused. , congratulations , Smart readers ! Language is full of ambiguity , But we can still build useful solutions with extremely simplified assumptions . It is possible to learn meaning specific representations , But now we're a little ahead of schedule .

Although for the input in this book , We rarely use expressions other than the single heat representation , But because of NLP Popular in 、 Historical reasons and purpose , Let's now introduce the term frequency （TF） And term frequency inversion document frequency （TF-idf） Express . These representations are used in information retrieval （IR） It has a long history , Even in today's production NLP The system has also been widely used .（ Translation is not enough )

### TF Express

The phrase 、 Of a sentence or document TF The expression is just the sum of the single heats that make up the word . To continue our stupid example , Use the single heat coding mentioned earlier ,`Fruit flies like time flies a fruit`

This sentence has the following TF Express :[1,2,2,1,1,1,0,0]. Be careful , Each entry is a sentence （ corpus ） The number of times the corresponding word appears in the . We use it TF(w） Representing a word TF.

Example 1-1： Use sklearn Generate “ Collapsed ” Single thermal or binary representation

```
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
corpus = ['Time flies flies like an arrow.',
'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(one_hot, annot=True,
cbar=False, xticklabels=vocab,
yticklabels=['Sentence 1', 'Sentence 2'])
```

The single heat of folding is that there are multiple... In a vector 1 Single heat

### TF-IDF Express

Consider a set of patent documents . You might want most of them to have something like `claim`

、`system`

、`method`

、`procedure`

Wait for the word , And often repeated many times .TF Means to weight more frequent words . However , image `claim`

Such common words do not increase our understanding of specific patents . contrary , If `tetrafluoroethylene`

Such rare words appear less frequently , But it is likely to indicate the nature of the patent document , We hope to give it more weight in our expression . Anti document frequency （IDF） Is a heuristic algorithm , This can be done precisely .

**IDF A common symbol for punishment , And reward the rare symbols in the vector representation .** Symbol `w`

Of `IDF(w)`

The definition of corpus is

among `n[w]`

It contains words `w`

Number of documents ,`N`

Is the total number of documents .TF-IDF The score is `TF(w) * IDF(w)`

The product of the . First , Please note that in all documents （ for example ,`n[w] = N`

）, `IDF(w)`

by 0, TF-IDF The score is 0, Completely punished this one . secondly , If a term rarely appears （ May only appear in one document ）, that IDF Namely `log n`

The maximum of .

Example 1-2： Use sklearn production TF-IDF Express

```
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(tfidf, annot=True, cbar=False, xticklabels=vocab,
yticklabels= ['Sentence 1', 'Sentence 2'])
```

stay ** Deep learning ** in , Rarely see use like TF-IDF Such heuristics represent encoding the input , Because the goal is to learn a representation .

Usually , We start with a single hot code using integer index and a special “ Embedded search ” The layer begins to construct the input of the neural network . In later chapters , We will give several examples of this .

### Target code

just as “ Supervised learning paradigm ” Pointed out , The exact nature of the target variable depends on the problem solved NLP Mission . for example , In machine translation 、 Summary and answer questions , The target is also text , The single thermal coding method described above is used for coding .

many NLP Tasks actually use category labels , Where the model must predict one of a set of fixed tags . A common way to code this is to use a unique index for each tag . When the number of output tags is too large , This simple representation can be problematic . An example of this is the language modeling problem , In this case , The task is to predict the next word , Given the words seen in the past . Tag space is the whole vocabulary of a language , It can easily grow to hundreds of thousands , Include special characters 、 Name and so on . We will revisit this problem and how to solve it in later chapters .

some NLP The problem involves predicting a value from a given text . for example , Give an English article , We may need to assign a numeric score or readability score . Given a restaurant comment fragment , We may need to predict the star up to the first decimal place . Tweets for a given user , We may need to predict the age group of users . There are several ways to encode digital objects , But simply bind the target to the classification “ Containers ” in （ for example ,“0-18”、“19-25”、“25-30” wait ）, It is a reasonable method to regard it as an ordered classification problem . Binding can be uniform , It can also be non-uniform , Data driven . Although a detailed discussion on this point is beyond the scope of this book , But we draw your attention to these problems , Because in this case , Target coding can significantly affect performance , We encourage you to refer to Dougherty wait forsomeone （1995） And its references .

## Calculation chart

chart 1-1 Will supervise learning （ Training ） The paradigm is summarized as data flow architecture , Model （ Mathematical expression ） Convert the input to obtain a prediction , Loss function （ Another expression ） Provide feedback signal to adjust the parameters of the model . The data flow can be easily realized by using the calculation diagram data structure . Technically speaking , Computational graph is an abstraction of mathematical expression modeling . In the context of deep learning , Implementation of calculation diagram （ Such as Theano、TensorFlow and PyTorch） Additional records were made （bookkeeping）, To achieve the automatic differentiation required to obtain the parameter gradient during training in the supervised learning paradigm . We will be in “PyTorch Basic knowledge of ” This point is further explored in . Reasoning （ Or forecast ） It's just simple expression evaluation （ Calculate the forward flow on the graph ）. Let's look at how calculation diagrams model expressions . Consider the expression :y=wx+b

This can be written as two sub expressions `z = wx`

and `y = z + b`

, Then we can use a ** Directed acyclic graph **（DAG） Represents the original expression , Among them ** Nodes are mathematical operations such as multiplication and addition **. The input of the operation is the incoming edge of the node , The output of the operation is the outgoing edge .

therefore , For the expression `y = wx + b`

, The calculation diagram is as shown in figure 1-6 Shown . In the next section , We will see PyTorch How to let us create computational graphics in an intuitive way , And how it allows us to calculate the gradient , Without considering any records （bookkeeping）.

# References

translator ：Yif Du

agreement ：CC BY-NC-ND 4.0

All models are wrong , But some of them are useful .

This book aims to provide natural language processing for newcomers （NLP） And deep learning , To cover important topics in these two areas . Both subject areas are growing exponentially . For a book that introduces deep learning and emphasizes implementation NLP The book of , This book occupies an important middle ground .

While writing this book , We have to make difficult decisions about what materials are missing , Sometimes even uncomfortable choices . For starters , We hope this book can provide a strong foundation for basic knowledge , And you can catch a glimpse of the possible content . In particular, machine learning and deep learning are empirical disciplines , Not intellectual science . We hope that the generous end-to-end code examples in each chapter invite you to participate in this experience . When we started writing this book , We from PyTorch 0.2 Start . Every PyTorch Update from 0.2 To 0.4 Modified the example .

PyTorch 1.0 Will be published when this book is published . The code examples in this book conform to PyTorch 0.4, It should be related to the upcoming PyTorch 1.0 Version works the same . Notes on the style of the book .

We deliberately avoid using mathematics in most places ; Not because deep learning mathematics is particularly difficult （ This is not the case ）, But because it distracts the attention of the main goal of the book in many cases —— Enhance the ability of beginners . in many instances , Whether in code or text , We all have similar motives , We tend to elaborate on brevity .

Advanced readers and experienced programmers can find ways to tighten code and so on , But our choice is as clear as possible , In order to cover the majority of the audience we want to reach .

- Read online
- Read online （Gitee）
- ApacheCN Learning resources
- ApacheCN Interview job group 724187166
- Code address

## Contribution Guide

This project needs to be proofread , Welcome to submit Pull Request.

Please bravely translate and improve your translation . Although we pursue excellence , But we don't want you to be perfect , So don't worry about translation mistakes —— In most cases , Our server has recorded all translations , So you don't have to worry about irreparable damage because of your mistakes .（ Adapted from Wikipedia ）

## Contact information

### person in charge

- Flying dragon : 562826179

### other

- In our apachecn/nlp-pytorch-zh github Ascending issue.
- Send an email to Email:
`apachecn@163.com`

. - In our Organize learning exchange group Contact the group leader / The administrator can .

## download

### Docker

```
docker pull apachecn0/nlp-pytorch-zh
docker run -tid -p <port>:80 apachecn0/nlp-pytorch-zh
# visit http://localhost:{port} To view the document
```

### PYPI

```
pip install nlp-pytorch-zh
nlp-pytorch-zh <port>
# visit http://localhost:{port} To view the document
```

### NPM

```
npm install -g nlp-pytorch-zh
nlp-pytorch-zh <port>
# visit http://localhost:{port} To view the document
```

- PyTorch natural language processing
- One 、 Basic introduction
- Two 、 Tradition NLP Quickly review
- 3、 ... and 、 Basic components of neural network
- Four 、 Feedforward network for natural language processing
- 5、 ... and 、 Embed words and types
- 6、 ... and 、 Sequence model of natural language processing
- 7、 ... and 、 Advanced sequence model of natural language processing
- 8、 ... and 、 High level sequence model of natural language processing
- Nine 、 classic , Frontier and next steps

** For the purpose of learning , Quote from this book , Non commercial use , I recommend you to read this book , Learning together ！！！**

** come on. !**

** thank !**

** Strive !**