[Kaishan chapter] natural language processing (pytorch version)

ZSYL 2021-10-14 06:36:39

PyTorch natural language processing

 Insert picture description here

natural language processing - Basic introduction

In this paper, the title :Natural-Language-Processing-with-PyTorch( One )

The authors :Yif Du

Release time :2018 year 12 month 17 Japan - 09:12

The last update :2019 year 02 month 16 Japan - 21:02

Original link :http://yifdu.github.io/2018/12/17/Natural-Language-Processing-with-PyTorch( One )/

license agreement : A signature - Noncommercial use - No derivatives 4.0 The international Please keep the original link and the author for reprint .

image Echo (Alexa)、Siri And Google translate Such well-known product names have at least one thing in common . They are all natural language processing (NLP) Product of application ,NLP Is one of the two main themes of the book .

NLP It's a set of techniques that use statistical methods , With or without linguistic insight , Understanding text in order to solve real-world tasks . This influence on the text “ understand Mainly by converting text into a usable computational representation , These calculations represent Discrete or continuous composite structures , Such as vectors or tensors 、 Graphics and trees .

from data ( In this case, the text ) Learn a representation suitable for the task Is the subject of machine learning . Text data using machine learning has a history of more than 30 years , But lately (2008 - 2010 Year begins ) [1] A set of machine learning techniques , It's called deep learning , Continue to develop and prove very effective artificial intelligence (AI) stay NLP The task , speech , And computer vision . Deep learning is another topic we want to talk about ; therefore , This book is about NLP And deep learning .

In short , Deep learning Enable people to use a method called Abstract concepts of computational graph and digital Optimization Technology Effectively learn representation from data . This is the success of deep learning and computing graphs , Like Google 、Facebook and Amazon Such large technology companies have released the implementation of their computing graphics framework and Library , To capture the thinking of researchers and engineers .

In this book , We consider the PyTorch, An increasingly popular based on python To implement the deep learning algorithm . In this chapter , We will explain what is a calculation diagram , And we choose to use PyTorch As a framework .

The field of machine learning and deep learning is broad . In this chapter , Most of the time in this book , Our main consideration is the so-called supervised learning ; in other words , Use the labeled training examples to learn . We explain the supervised learning paradigm , This will be the basis of this book . If you are not familiar with many of these terms so far , So you are right .

This chapter , And future chapters , Not only clarify this point , And studied them deeply . If you are already familiar with some of the terms and concepts mentioned here , We still encourage you to follow the following two reasons : Create a shared vocabulary for the rest of the book , And fill any gaps needed to understand future chapters .

The goal of this chapter is :

  • Develop a clear understanding of the supervised learning paradigm , Understand the terms , And develop a conceptual framework to deal with the learning tasks of future chapters
  • Learn how to code the input of the learning task
  • Understand what a graph is
  • master PyTorch Basic knowledge of

Let's get started !

Supervised learning paradigm

machine learning Medium supervise , Or simply supervised learning , It means to be The goal is ( What is predicted ) The truth of is used to observe ( Input ) The situation of .

for example , stay Document classification in , The goal is It's a Category labels ,** Observe ( Input )** It's a document .

for example , stay Machine translation in , Observe ( Input ) It's a sentence in a language , The target is a sentence in another language . By understanding the input data , We are in the picture 1-1 The supervised learning paradigm is demonstrated in .

 Insert picture description here

We can decompose the supervised learning paradigm into six main concepts , As shown in the figure : Observe : Observation is what we want to predict .

We use it x It means the observed value . We sometimes call Observations “ Input ”. The goal is : The target is the label corresponding to the observation . It's usually something predicted .

According to machine learning / Standard symbols in deep learning , We use it y Said these . Sometimes , This is called the real situation . Model : A model is a mathematical expression or function , It accepts an observation x, And predict the value of its target tag . Parameters : Sometimes called weight , These parametric models . Symbols used in the standard w( The weight ) or w_hat.

forecast : forecast , Also known as It is estimated that , It is the value of the target guessed by the model when the observation value is given . We use one hat Said these . therefore , The goal is y The prediction of y_hat To express .

Loss function : The loss function is a function of comparing the distance between the prediction and the observed target in the training data .

Given a goal and its prediction , The loss function will assign a scalar real value called loss . The lower the loss value , The better the prediction effect of the model on the target . We use it L The loss function .

Although in NLP / Deep learning modeling or writing this book , This is not officially valid in Mathematics , But we will formally restate the supervised learning paradigm , In order to provide standard terminology for new readers in this field , So that they are familiar with arXiv Research on symbols and writing styles in papers .

Consider a data set D={X[i],y[i]}, i=1..n, Yes n An example . Given this data set , We want to learn one by weight w Parameterized functions ( Model )f, in other words , We are right. f Make a hypothesis about the structure of , Given this structure , A weight w The learning value will fully characterize the model .

For a given input X, Model to predict y_hat As the target : y_hat = f(X;W) In supervised learning , For training examples , We know the real goal of observation y. The loss of this instance will be L(y, y_hat).

then , Supervised learning becomes a search for the optimal parameters / A weight w The process of , So that all n An example of cumulative loss minimization .

utilize **( Random ) Gradient descent method ** Training Supervised learning The goal is to select parameter values for a given dataset , Minimize the loss function . let me put it another way , This is equivalent to finding the root in the equation .

We know Gradient descent method It is a common Find the root of the equation Methods . recall , In the traditional gradient descent method , We're right about root ( Parameters ) Some initial values of , And iteratively update these parameters , Until the objective function ( Loss function ) The calculated value of is below the acceptable threshold ( Convergence criterion ).

For large datasets , Due to memory limitations , It is usually impossible to achieve traditional gradient descent on the whole data set , And because of the computational overhead , Very slow . contrary , An approximate Gradient descent is called random gradient descent (SGD).

In random cases , Data points or subsets of data points are randomly selected , And calculate the gradient of the subset . When using a single data point , This method is called pure SGD, When using ( Multiple ) A subset of data points , Let's call this Small batch SGD.

Usually ,“ pure ” and “ Small batch ” These two words are deleted when they become clear according to the context . in application , Pure... Is rarely used SGD, Because it will lead to very slow convergence due to noisy updates . commonly SGD There are different variants of the algorithm , All for faster convergence . In later chapters , We will explore some of these variants , And how to use gradients to update parameters . This iterative process of updating parameters is called back propagation . Every step of back propagation ( Also known as cycle ) from Forward pass and Pass back form .

Pass forward calculates the input with the current value of the parameter and calculates the loss function .

The back pass updates the parameters with a loss gradient .

Please note that , up to now , Nothing here is specific to deep learning or neural networks . chart 1-1 The direction of the arrow in indicates the direction of the data when training the system “ flow ”.

About Training and “ Calculation chart ” in “ flow ” The concept of , We have more to say , But first of all , Let's see how to use numbers to represent NLP Input and objectives in the question , In this way, we can train the model and predict the results .

Observation and target coding

We need to use Numbers represent observations ( Text ), For use with machine learning algorithms . chart 1-2 A visual description is given .

 Insert picture description here

For text A simple way to express... Is to use a number vector . There are countless ways to perform this mapping / Express . in fact , Much of this book is devoted to learning such task representations from data . However , Let's start with some simple count based representations based on Heuristics .

Simple though , But they are very powerful , Or it can be used as a richer starting point for learning . All these count based representations start with a fixed dimensional vector .

One-Hot Express

seeing the name of a thing one thinks of its function , Single heat means starting with a zero vector , If the word appears in a sentence or document , Set the corresponding entry in the vector to 1. Consider the following two sentences .

Time flies like an arrow.
Fruit flies like a banana.

Mark the sentence , Ignore punctuation , And all words are represented in lowercase letters , You get a size of 8 The vocabulary of :{time, fruit, flies, like, a, an, arrow, banana}. therefore , We can use an eight dimensional single heat vector to represent each word . In this book , We use 1[w] To mark / word w The single heat of .

For the phrase 、 Sentence or document , The single heat representation of compression is only the single heat representation of the logical or of its constituent words . Use the figure 1-3 Code shown , The phrase like a banana The single heat representation of will be a 3×8 matrix , The columns are 8 One dimensional heat vector . You usually see “ Fold ” Or binary encoding , Where text / Phrases are represented by vectors of vocabulary length , use 0 and 1 Indicates the absence or presence of a word .like a banana The binary code of is :[0,0,0,1,1,0,0,1].

 Insert picture description here

Be careful : At this point , If you think we put flies Two different meanings of ( Or feel ) Confused. , congratulations , Smart readers ! Language is full of ambiguity , But we can still build useful solutions with extremely simplified assumptions . It is possible to learn meaning specific representations , But now we're a little ahead of schedule .

Although for the input in this book , We rarely use expressions other than the single heat representation , But because of NLP Popular in 、 Historical reasons and purpose , Let's now introduce the term frequency (TF) And term frequency inversion document frequency (TF-idf) Express . These representations are used in information retrieval (IR) It has a long history , Even in today's production NLP The system has also been widely used .( Translation is not enough )

TF Express

The phrase 、 Of a sentence or document TF The expression is just the sum of the single heats that make up the word . To continue our stupid example , Use the single heat coding mentioned earlier ,Fruit flies like time flies a fruit This sentence has the following TF Express :[1,2,2,1,1,1,0,0]. Be careful , Each entry is a sentence ( corpus ) The number of times the corresponding word appears in the . We use it TF(w) Representing a word TF.

Example 1-1: Use sklearn Generate “ Collapsed ” Single thermal or binary representation

from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
corpus = ['Time flies flies like an arrow.',
'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer(binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(one_hot, annot=True,
cbar=False, xticklabels=vocab,
yticklabels=['Sentence 1', 'Sentence 2'])

 Insert picture description here

The single heat of folding is that there are multiple... In a vector 1 Single heat

TF-IDF Express

Consider a set of patent documents . You might want most of them to have something like claimsystemmethodprocedure Wait for the word , And often repeated many times .TF Means to weight more frequent words . However , image claim Such common words do not increase our understanding of specific patents . contrary , If tetrafluoroethylene Such rare words appear less frequently , But it is likely to indicate the nature of the patent document , We hope to give it more weight in our expression . Anti document frequency (IDF) Is a heuristic algorithm , This can be done precisely .

IDF A common symbol for punishment , And reward the rare symbols in the vector representation . Symbol w Of IDF(w) The definition of corpus is

among n[w] It contains words w Number of documents ,N Is the total number of documents .TF-IDF The score is TF(w) * IDF(w) The product of the . First , Please note that in all documents ( for example ,n[w] = N), IDF(w) by 0, TF-IDF The score is 0, Completely punished this one . secondly , If a term rarely appears ( May only appear in one document ), that IDF Namely log n The maximum of .

Example 1-2: Use sklearn production TF-IDF Express

from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(tfidf, annot=True, cbar=False, xticklabels=vocab,
yticklabels= ['Sentence 1', 'Sentence 2'])

 Insert picture description here

stay Deep learning in , Rarely see use like TF-IDF Such heuristics represent encoding the input , Because the goal is to learn a representation .

Usually , We start with a single hot code using integer index and a special “ Embedded search ” The layer begins to construct the input of the neural network . In later chapters , We will give several examples of this .

Target code

just as “ Supervised learning paradigm ” Pointed out , The exact nature of the target variable depends on the problem solved NLP Mission . for example , In machine translation 、 Summary and answer questions , The target is also text , The single thermal coding method described above is used for coding .

many NLP Tasks actually use category labels , Where the model must predict one of a set of fixed tags . A common way to code this is to use a unique index for each tag . When the number of output tags is too large , This simple representation can be problematic . An example of this is the language modeling problem , In this case , The task is to predict the next word , Given the words seen in the past . Tag space is the whole vocabulary of a language , It can easily grow to hundreds of thousands , Include special characters 、 Name and so on . We will revisit this problem and how to solve it in later chapters .

some NLP The problem involves predicting a value from a given text . for example , Give an English article , We may need to assign a numeric score or readability score . Given a restaurant comment fragment , We may need to predict the star up to the first decimal place . Tweets for a given user , We may need to predict the age group of users . There are several ways to encode digital objects , But simply bind the target to the classification “ Containers ” in ( for example ,“0-18”、“19-25”、“25-30” wait ), It is a reasonable method to regard it as an ordered classification problem . Binding can be uniform , It can also be non-uniform , Data driven . Although a detailed discussion on this point is beyond the scope of this book , But we draw your attention to these problems , Because in this case , Target coding can significantly affect performance , We encourage you to refer to Dougherty wait forsomeone (1995) And its references .

Calculation chart

chart 1-1 Will supervise learning ( Training ) The paradigm is summarized as data flow architecture , Model ( Mathematical expression ) Convert the input to obtain a prediction , Loss function ( Another expression ) Provide feedback signal to adjust the parameters of the model . The data flow can be easily realized by using the calculation diagram data structure . Technically speaking , Computational graph is an abstraction of mathematical expression modeling . In the context of deep learning , Implementation of calculation diagram ( Such as Theano、TensorFlow and PyTorch) Additional records were made (bookkeeping), To achieve the automatic differentiation required to obtain the parameter gradient during training in the supervised learning paradigm . We will be in “PyTorch Basic knowledge of ” This point is further explored in . Reasoning ( Or forecast ) It's just simple expression evaluation ( Calculate the forward flow on the graph ). Let's look at how calculation diagrams model expressions . Consider the expression :y=wx+b

This can be written as two sub expressions z = wx and y = z + b, Then we can use a Directed acyclic graph (DAG) Represents the original expression , Among them Nodes are mathematical operations such as multiplication and addition . The input of the operation is the incoming edge of the node , The output of the operation is the outgoing edge .

therefore , For the expression y = wx + b, The calculation diagram is as shown in figure 1-6 Shown . In the next section , We will see PyTorch How to let us create computational graphics in an intuitive way , And how it allows us to calculate the gradient , Without considering any records (bookkeeping).

 Insert picture description here


 Insert picture description here

translator :Yif Du

agreement :CC BY-NC-ND 4.0

All models are wrong , But some of them are useful .

This book aims to provide natural language processing for newcomers (NLP) And deep learning , To cover important topics in these two areas . Both subject areas are growing exponentially . For a book that introduces deep learning and emphasizes implementation NLP The book of , This book occupies an important middle ground .

While writing this book , We have to make difficult decisions about what materials are missing , Sometimes even uncomfortable choices . For starters , We hope this book can provide a strong foundation for basic knowledge , And you can catch a glimpse of the possible content . In particular, machine learning and deep learning are empirical disciplines , Not intellectual science . We hope that the generous end-to-end code examples in each chapter invite you to participate in this experience . When we started writing this book , We from PyTorch 0.2 Start . Every PyTorch Update from 0.2 To 0.4 Modified the example .

PyTorch 1.0 Will be published when this book is published . The code examples in this book conform to PyTorch 0.4, It should be related to the upcoming PyTorch 1.0 Version works the same . Notes on the style of the book .

We deliberately avoid using mathematics in most places ; Not because deep learning mathematics is particularly difficult ( This is not the case ), But because it distracts the attention of the main goal of the book in many cases —— Enhance the ability of beginners . in many instances , Whether in code or text , We all have similar motives , We tend to elaborate on brevity .

Advanced readers and experienced programmers can find ways to tighten code and so on , But our choice is as clear as possible , In order to cover the majority of the audience we want to reach .

Contribution Guide

This project needs to be proofread , Welcome to submit Pull Request.

Please bravely translate and improve your translation . Although we pursue excellence , But we don't want you to be perfect , So don't worry about translation mistakes —— In most cases , Our server has recorded all translations , So you don't have to worry about irreparable damage because of your mistakes .( Adapted from Wikipedia )

Contact information

person in charge




docker pull apachecn0/nlp-pytorch-zh
docker run -tid -p <port>:80 apachecn0/nlp-pytorch-zh
# visit http://localhost:{port} To view the document


pip install nlp-pytorch-zh
nlp-pytorch-zh <port>
# visit http://localhost:{port} To view the document


npm install -g nlp-pytorch-zh
nlp-pytorch-zh <port>
# visit http://localhost:{port} To view the document

For the purpose of learning , Quote from this book , Non commercial use , I recommend you to read this book , Learning together !!!

come on. !

thank !

Strive !

Please bring the original link to reprint ,thank
Similar articles