The maximum a posteriori probability of Bayesian inference (MAP)

This paper records the mathematical principle of Bayesian posterior probability distribution in detail , A binary classification problem based on Bayesian posterior probability , Let's talk about my understanding of Bayesian inference .

1. Dichotomous problem

Given N The data set of 2 samples , use \(X\) To express , Each sample \(x_n\) There are two properties , Finally, it belongs to a certain category \(t\)

\(t=\left\{0,1\right\}\)

\(\mathbf{x_n}=\begin{pmatrix}x_{n1} \\ x_{n2} \\ \end{pmatrix}\), Suppose the model parameters \(w=\begin{pmatrix} w_1 \\ w_2\end{pmatrix}\)

\(\mathbf{X}=\begin{bmatrix} x_1^T \\ x_2^T \\. \\. \\ x_n^T\end{bmatrix}\)

Picture the sample set as follows :

According to Bayesian formula, there are :

\[p(w|t,X)=\frac {p(t|X,w)p(w)} {p(t|X)} \] ( The formula 1)

\(p(w | t,X)\) Tell us : In a given training sample set \(X\) And a classification of these samples \(t\) ( It's a supervised learning , Because we already have a sample set \(X\)、 And the classification of each sample in the sample set \(t\)), You need to solve the model parameters \(w\) . therefore ,\(w\) It is unknown. , It needs to be solved by Bayesian probability formula according to samples . We have found \(p(w|t,X)\) The distribution of , So you know the model parameters \(w\)

When we get the optimal model parameters \(w^*\) after , Given a sample to be predicted \(\mathbf{x_{new}}\) According to the formula \[P(T_{new}=1|x_{new}, w^*)\] To calculate the new sample \(\mathbf{x_{new}}\) Classified as 1 What's the probability of , That's what the model predicts .

The formula 1 The medium size has three parts on the right ,\(p(t|X,w)\) It's called likelihood probability (likelihood),\(p(w)\) It's called a priori probability , These two parts are generally easier to solve . The hardest part to solve is the denominator : \(p(t|X)\) It's called the boundary likelihood function (marginal likelihood). But fortunately , Boundary likelihood function and model parameters \(w\) irrelevant , therefore , You can think of the denominator as about \(w\) A constant of .

In Mathematics , If the prior probability is conjugate with the likelihood probability , So the posterior distribution probability \(p(w|t, X)\) It follows the same distribution as the prior probability . for instance : A priori probability obeys Beta Distribution , Likelihood probability follows binomial distribution , The prior probability distribution is conjugate with the likelihood probability distribution , Then the posterior probability also obeys Beta Distribution .

therefore , When using Bayes formula , If the selected prior probability distribution is conjugate with the likelihood probability distribution , The posterior probability distribution can be easily calculated ( Or to be able to accurately calculate a Concrete / The precise Model parameters \(w^*\)), This is it. :can compute posterior analytically. But the reality is , They are not conjugate , So there are three common approximate calculation methods :

  • Point estimation (Point Estimate--MAP Method )
  • Laplace approximation method (Laplace approximation)
  • Sampling method (Sampling--Metropolis-Hastings)

This paper only introduces the point estimation method .

Back to the formula 1, Let's start with a priori probability \(p(w)\) , A priori probability is similar to when making a decision , Existing experience . Because we already have training samples \(X\), Draw the contour lines corresponding to these samples , They are very close to Gaussian distribution , Then it can be considered that the prior probability obeys Gaussian distribution . That is to say \[p(w)=N(0,\sigma^2I)\]. among ,\(w\) It's a vector ,\(I\) It's a unit matrix .

And then there's likelihood probability \(p(t|X,w)\) , Suppose that given the model parameters \(w\) And sample sets \(X\) Under the condition of , The classification results of each sample are independent of each other , therefore :

\[p(t|X,w)=\prod_{n=1}^N p(t_n|x_n, w)\] ( The formula 2)

for instance , When the model parameters are known \(w\) when ,\(w\) take \(x_1\) The prediction is positive , take \(x_2\) The prediction is negative …… take \(x_n\) Predict the positive , The prediction results of each sample are independent of each other . namely :\(w\) Yes \(x_1\) Forecast results of It won't affect its effect on \(x_2\) Forecast results of .

because , It's a dichotomy problem ,\(t_n=\left\{0,1\right\}\) , We can further extend the formula 2 It's written in :\[p(t|X,w)=\prod_{n=1}^N p(T_n=t_n|x_n, w)\] , among \(T_n\) Representative sample \(x_n\) Classified into a certain class A random variable ,\(t_n\) Is the value of the random variable . such as \(t_n=0\) Presentation sample \(x_n\) Be classified as positive ,\(t_n=1\) Means to be classified as negative .

2. sigmod function

Because the probability of a random variable taking a certain value is in [0,1] Between , So ask for the solution \(p(t|X,w)\), Our goal is : Find a function \(f(\mathbf{x_n};w)\) This function produces a probability value . To simplify the discussion , choice \(sigmod(w^T*x)\) , therefore :

\[P(T_n=1|x_n,w)=\frac{1}{1+exp(-w^T*x_n)}\]

that :

\[P(T_n=0|x_n,w)=1-P(T_n=1|x_n,w)=\frac{exp(-w^T*x_n)}{1+exp(-w^T*x_n)}\]

Combine the above two formulas into one :

\[P(T_n=t_n|x_n,w)=P(T_n=1|x_n,w)^{t_n}P(T_n=0|x_n,w)^{1-t_n}\]

about N Samples , The formula 2 Can be written as :

\[p(t|X,w)=\prod_{n=1}^N (\frac{1}{1+exp(-w^T*x_n)})^{t_n}(\frac{exp(-w^T*x_n)}{1+exp(-w^T*x_n)})^{1-t_n}\] ( The formula 3)

thus , The prior probability follows Gaussian distribution , The likelihood probability is determined by the formula 3 give , Then we can solve the posterior probability formula mentioned above :\[p(w|X,t,\sigma^2)\]

As long as we get the posterior probability , You can use the following formula to calculate the probability that the new sample will be divided into negative categories :

\[P(t_{new}=1|x_{new}, X, t)=E_{p(w|X,t,\sigma^2)}\left(\frac{1}{1+exp(-w^T*x_{new})}\right)\]

Explain the formula : Because we've got the posterior probability \(p(w|X,t,\sigma^2)\) The expression of , It's about \(f(x_n;w)\) Function of , Calculate the expected value of this function \(E\), The expectation is Predict new samples \(x_{new}=1\) Probability .

good , The next step is to solve the posterior probability .

3. Find the posterior probability

I've said that before , The prior probability follows Gaussian distribution \(N(0,\sigma^2I)\), The likelihood distribution is determined by The formula 3 give , And the denominator -- The boundary likelihood function is a function of \(w\) The constant , So define a function \(g(w;X,t,\sigma^2)=p(t|X,w)p(w|\sigma^2)\) , function \(g\) Obviously with posterior probability \(p(w|X,t,\sigma^2)\) Is proportional to the . therefore , We get the function \(g\) The maximum of , It is equivalent to finding the optimal parameter of posterior probability \(w^*\).

The problem here is : How can we maximize the function g Well ?\(g\) yes \(w\) Function of ,\(w\) Which value does the function take \(g\) To the maximum ?

Here we need to use a method called Newton's method (Newton-Raphson method). Newton's method can be used to Looking for zeros in a function . It works through the following formula :

\[x_{n+1}=x_n-\frac{f(x_n)}{f'(x_n)}\]

Continuous iteration , Finally, we find that the value of the function is 0 The point of .

In mathematics, when the judgment function takes the extreme value at a certain point , There are the following theorems :

Let's take the derivative function of one variable \(h(x)\) For example ,\(h(x)\) Derivative is 0 The point of is the extreme point , But this extreme point is a minimum , Or the maximum value ? In this case, we can judge \(h(x)\) It is the second derivative to judge whether the extreme point is a minimum or a maximum . if \(h'(x_n)=0\) And \(h''(x_n)<0\), be , be \(h(x)\) In \(x_n\) Take the maximum .

therefore , If you can judge \(g(w;X,t,\sigma^2)\) About \(w\) The second derivative of is less than 0, Then you can use Newton's method to solve \(g(w;X,t,\sigma^2)\) The first derivative of is about \(w\) Zero point of , namely \(g'(w;X,t,\sigma^2)=0\) when \(w\) The values for \(w_0\), This \(w_0\) It's the best solution \(w^*\) 了 .

good , Let's prove it \(g(w;X,t,\sigma^2)\) About \(w\) The second derivative of is less than 0 Of . because \(w\) It's a vector , In multivariate functions , It's equivalent to proving that :\(g(w;X,t,\sigma^2)\) About \(w\) The Hessian matrix of is negative definite .

Will function \(g\) Take the logarithm , Maximize \(log (g(w;X,t,\sigma^2))\)

\(log (g(w;X,t,\sigma^2))=log({p(t|X,w)p(w|\sigma^2}))\)

\[=log(p(t|X,w)+log(p(w|\sigma^2)\]

To simplify the formula , Make the following agreement :

hypothesis \(w\) It's a \(D\) Dimension vector :

The first three terms are after the prior distribution obeys the Gaussian distribution , The result of simplification . According to the vector derivative formula :\[\frac{\partial w^Tw}{\partial w}=w\]

By the chain derivation rule :

obtain :

therefore :\(log (g(w;X,t,\sigma^2))\) Yes \(w\) The first partial number of is as follows :

The second partial derivative is as follows :

\(I\) It's a unit matrix ,\(0=<P_n<=1\) It's a probability value , The second partial derivative is the Hessian matrix , It's negative .

The proof is complete .

thus , You can safely use Newton's method to iterate , find \(g(w;X,t,\sigma^2)\) When the maximum value is taken, the parameter \(w\) The value of the , And this value is \(w^*\)

Now? ,\(w^*\) Come on , You can use the following formula to predict new samples \(x_{new}\) Predicted to be negative (\(T_{new}\) The value is 1) The probability of

\[P(T_{new}=1|x_{new},w^*)=\frac{1}{1+exp(-w^{*T}x_{new})}\]

decision boundary

Because it's a binary problem , Let's look at classification using Bayesian posterior probability Decide the boundary What does it look like . Because the output is a probability value , It's obvious that \(P(T_{new}=1|x_{new},w^*)>0.5\) The prediction is negative ,\(P(T_{new}=1|x_{new},w^*)<0.5\) The prediction is positive . That's equal to 0.5 When ?

according to :\[P(T_{new}=1|x_{new},w^*)=\frac{1}{1+exp(-w^{*T}x_{new})}=0.5\] obtain :

\[-w^{*T}*x=0=w_1^*x_1+w_2^*x_2\]

\[x_2=\frac{w_1^*}{w_2^*}*(-x_1)\]

That is to say, samples \(\mathbf{x}=\begin{pmatrix}x_{1} \\ x_{2} \\ \end{pmatrix}\) Two properties of \(x_1\) and \(x_2\) It's linearly proportional . And this straight line is decision boundary

summary

Bayesian method is a common method in machine learning , There are three parts in Bayesian formula , A priori probability distribution function 、 Likelihood probability distribution function 、 And the boundary likelihood probability distribution function ( The denominator of Bayesian formula ). The three parts are worked out , Then we get the posterior probability distribution , And then for a new sample \(x_{new}\) Calculate the expected value of the posterior probability distribution , This expectation is the prediction result of Bayesian model .

Because the calculation of posterior probability distribution depends on prior probability distribution function 、 Likelihood probability distribution function , When the two are conjugate , Posterior probability and prior probability obey the same distribution function , So we can deduce and calculate the posterior probability distribution (posterior could be computed analytically). however , When the two are not conjugate , It is an approximation of the posterior probability distribution . There are three ways to calculate approximations , Point estimation (point estimate --- MAP), Laplace approximation ,Metropolis-Hastings Sampling method . And this article mainly introduces It's the first way : Point estimation (point estimate --- maximum a posteriori).

maximum a posteriori Where is the maximization of the quality of life ? In fact, it is reflected in the maximization of the likelihood distribution function . The negative definiteness of Hesse matrix proves \(g(w;X,t,\sigma^2)\) With maximum , And then we use Newton's method to find this function \(g\) Take the optimal parameter solution of the maximum value \(w^*\). And the optimal parameters are obtained \(w^*\), The posterior probability distribution formula is obtained . For a new sample to be predicted \(x_{new}\), Calculate the expected value of the posterior probability distribution of the sample , The expected value is the prediction result of Bayesian model for new samples .

Reference material

Newton method :https://zh.wikipedia.org/wiki/ Newton method
Blog Garden Markdown The formula is messy :http://www.cnblogs.com/cmt/p/markdown-latex.html

original text :http://www.cnblogs.com/hapjin/p/8834794.html

The maximum a posteriori probability of Bayesian inference (MAP) More articles about

  1. Bayesian inference &amp;&amp; On probability programming

    1. Before that 0x1: The idea of Bayesian inference Let's start our discussion with an example . Xiao Ming is a veteran programmer , But I still believe that bug It's still possible to exist in the code . therefore , After the implementation of a particularly difficult algorithm , He decided to start with a simple ...

  2. (main) Bayesian Statistics | Bayes theorem | Bayesian inference | Bayesian linear regression | Bayes&#39; Theorem

    2019 year 08 month 31 Daily update I read an article in NM Only in this paper did we understand the importance and universality of Bayesian method , Combined with the most popular DL, There will be unexpected results . Some of the most intuitive understandings right now : The core of probability is that the space of possibility is certain , The three body world will not ...

  3. How to understand Bayesian inference and beta Distribution ?

    There's a coin ( I don't know if it's Fair ), If you throw it three times , All three times “ flowers ”: It can be said that it has two sides “ flowers ” Do you ? 1 Bayesian inference According to the traditional algorithm , Three tosses and three gains “ flowers ”, that “ flowers ” The probability of that would be : But three tosses is too few , End ...

  4. Probabilistic programming :《 Bayesian method probability programming and Bayesian inference 》 chinese PDF+ english PDF+ Code

    Bayesian reasoning is very natural and powerful . However , Most books talk about Bayesian reasoning , It depends on very complex mathematical analysis and artificial examples , So that people without a strong mathematical background can not contact .< Bayesian method probability programming and Bayesian inference > From programming . The angle of calculation ...

  5. Based on Bayesian Networks (Bayes Netword) On the application of graph model

    1. Bayesian network theory part In another article, the author summarizes the theory of Bayesian network , In this paper , We focus on its application in specific scenarios . 2. Let's start with the problem of probability prediction 0x1: The difficulty of conditional probability prediction model We know , naive bayesian classification ...

  6. Java Realize the analysis of emotional words based on Naive Bayes

    Naive Bayes (Naive Bayesian) It is a classification method based on Bayes theorem and independent assumption of feature conditions , It is a supervised learning method based on probability theory , It is widely used in natural language processing , And occupies a very important position in the field of machine learning . Before ...

  7. Variable decibel Bayesian learning (variational bayesian learning) And heavy parameter technique (reparameterization trick)

    Abstract : The conventional neural network weight is a certain value , Bayesian neural networks (BNN) in , Weight is valued as a probability distribution .BNN The optimization of is often dependent on the heavy parameter technique (reparameterization trick), This paper summarizes the optimization method ...

  8. The theory of vernacular Bayes and its application in the prediction of football match results C# Realization

    From last year “ Markov chain for lottery prediction ” It's been a year , At the same time, I also plan to build a lottery data framework , Framework for analysis and prediction , It will be published gradually this year , A catalogue has been drawn up , What opinions and questions do you have , You can see , I will gradually change the message in the following article ...

  9. machine learning :python How to use naive Bayes algorithm in

    I want to repeat why the title is " Use " instead of " Realization ": First , The algorithm provided by professionals is more efficient and accurate than our own algorithm . secondly , For people who are not good at math , For the sake of reality ...

Random recommendation

  1. 【ASP.NET Programmer benefits 】 Create a popular ORM( Two )

    I have introduced to you in the last article AntORM Framework [ASP.NET Programmer benefits ] Create a popular ORM( One ), Today I will focus on how to use this framework 1>AntORM All members If you only want to operate one kind of database , can ...

  2. jQuery-- Event summary

    Standard binding : bind(type,[,data],fn)==> The first parameter is the event type The second optional parameter is event.data Extra data objects passed to event objects The third parameter is the handler for binding Short for binding ...

  3. kafka java example

    producer package com; import java.util.Properties; import java.util.concurrent.TimeUnit; import kafka.jav ...

  4. JS Namespace Realization way collect

    One . Object registration // Declare a global object Namespace, To register the namespace Namespace = new Object(); // Global objects just exist register function , The parameter is the full path of the namespace , Such as &q ...

  5. 50 Classic JAVA Programming questions (6-10)

    50 Classic JAVA Programming questions (6-10), I did it tonight 10 That's the way , Exhausted ... I don't think it's very difficult , I just don't know if it's the best way to achieve it ! I hope the gods can give us some advice ... [ Program 6]GCDAndLCM.java subject : transport ...

  6. A selection of interview questions for programmers 100 topic (38)- Output 1 To the biggest N digit [ Algorithm ]

    author : He Haitao Source :http://zhedahht.blog.163.com/ subject : Input number n, Output in sequence from 1 maximal n position 10 Hexadecimal number . Such as input 3, The output 1.2.3 Up to the biggest 3 The number of digits is 999. analysis : this ...

  7. Python3 take txt Data into lists , Sort , Screening

    Python What programmers need to know 30 Tips The first is data : Write the above four data in the new txt In file 1. take txt Data into a list with open('james.txt') as jaf: data ...

  8. rpm carding

  9. stay eclipse Start project report in java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: PermGen space

    When we start a project, we often get the memory overflow error   Just set up the memory ok error message java.util.concurrent.ExecutionException: java.lang.OutOfMemo ...

  10. php Expand memcache and memcached difference ? as well as memcached Introduction of software

    quote “http://www.vicenteforever.com/2012/03/memcache-different-memcached/” memcached It's a software , and PHP It includes memcac ...