Today is a paper on personalized collocation recommendation , yes 2017 Paper in , This is also a relatively early work of combining collocation and personalized recommendation , Methods based on metric learning and ranking learning .
Thesis title ：FashionNet: Personalized Outfit Recommendation with Deep Neural Network
Add recommendations to the mix , Realize personalized matching recommendation . It also adopts the algorithm of deep learning -- CNN Model , The concrete implementation is to use VGG-16 Network model as the basis , Realization FashionNet Model , One FashionNet The model includes two networks , Feature extraction network and matching network for matching , In this part, the author designed 3 It's a network structure , And did a comparative experiment , And in training Two stage training strategy , That is, first train a general collocation model , Then add user information to fine tune the network model , In addition, the final structure of the network is as follows , Enter two sets of matching , The positive sample is the training set polyvore , That is, the matching uploaded by the user , A negative sample is a random selection of clothes , Pass respectively FashionNet, And then calculate RankLoss.
The problem studied in this paper is similar to metric learning , Measuring learning needs to learn the distance between objects , Or similarity , What you need to learn about matching is the matching between clothes ;
Three network structures are designed , As shown in the figure below .
The first network structure , When inputting, the picture of a suit of clothes in the matching is displayed on the color channel concat, What you get is a
w*h*9 The input picture of , Then input VGGNet The Internet , The extracted features , Next is 1 individual FC( Full connection ) layer + softmax, Output two values , The probability of liking and dislike ;
The characteristic of this network is to integrate feature learning and matching measurement ;
The second network structure , When inputting, the matching clothes pictures will not be concat get up , Instead, they pass in separate VGGNet in , But every kind of clothes goes through the same network , That is, they are all mapped into a common hidden space , Then the extracted features concat get up , Follow behind 3 individual FC layer + softmax.
The first two network problems ：
There will be difficulties in obtaining high-order relationships ;
Resulting in the expansion of data space （ Raw data or feature space ）, This requires a large number of training samples , But there are not enough training samples ;
therefore , The third network structure , Feature extraction part and FashionNet B It's the same , But then the features extracted from any two clothes concat, Through their respective matching networks （3FC+softmax）, The number of matching networks is related to the number of matching ; Different matching networks are applied to different types of clothes , Like a coat - shoes , Jacket - The bottom is the same , The results of the matching network will be accumulated to get the final score of the matching s.
Training network models
The network structure of the final training is shown in the figure below ,
Personalized collocation recommendation is not just about measuring learning , Or is it a sort of learning （learning-to-rank ） problem , The input is a positive and negative sample pair , Enter a..., respectively FashionNet , Score of network output s , Then use rank loss Calculation Loss. The formula is shown in the figure below ：
The reasons for adopting the two-stage training strategy are ：
The matching number of each user is relatively small , Especially if we need to train a neural network with good performance ;
The collocations of many users are quite similar , There are not many of these similar collocations ;
Although each user's collocation aesthetics are not exactly the same , But there are still many common collocation attempts , For example, shirts usually go with jeans
For the above reasons , Select the strategy of two-stage training network .
The first stage The training is to learn a general collocation network model , In this step , User information will be discarded , Mix all the collocations together , And then train the network . A training sample is a matching pair of positive samples and negative samples , Used in ImageNet Initialization of pre trained parameters on VGGNet, Other network layer parameters are initialized by Gaussian distribution ;
The second stage Is to train the user in a specific model （user-specific model） To make personalized recommendations . Initialization is the network model parameters trained in the first step , Then use each user's data to fine tune the network .
For the above mentioned 3 A network structure is fine tuned in the second stage , It's set up like this ：
FashionNet A The fine tuning is Fine tune the whole network （ Feature extraction and matching network ）;
about B and C Two networks , There are two strategies ：
Fine tune the whole network , Both feature extraction network and matching network will have personalized parameters , That is, for different users , The same clothes will have different characteristics ;
Fixed feature extraction network , Only fine tune the matching network . This will speed up the training , More fine-tuning in practical application is this approach .
The parameter settings of the three network structures are as follows ：
Adopted Polyvore Data sets , Include from 800 Matching data uploaded by users , Each set is matched with three clothes -- Jacket 、 Bottoms and shoes , take polyvore As a positive sample , Negative samples are randomly selected 、 The matching of bottoms and shoes .
Data sets are divided into training sets 、 Validation set and test set , In each set , The number of positive samples per user is 202,46 and 62, Negative samples are positive samples 6 times , That is to say 1212,276 and 372.
NDCG Evaluation criteria , Used to evaluate a sorted list , The formula is as follows , The first m Location NDCG yes ：
Represents the score of an ideal ranking , For positive samples , yes 1, The negative sample is 0;
NDCG The optimal value is 1;
Mean NDCG yes [email protected] （m=1,....M, Indicates the length of the sort ） The average of ;
Finally adopted indicators ：
mean NDCG： Of all users mean [email protected], That is, first calculate the different m Numerical NDCG, Then average , Divided by the number of users ;
average of NDCG @m： Of all users [email protected], Different m The mean of the values （ Just divide by the number of users ）
top-k The number of positive samples in the result （ specific k Number ）
The framework used is Caffe,batch by 30,epoch yes 18.
Learning rate strategy reference paper 《Return of the devil in the details: Delving deep into convolutional nets》.
The experimental results are shown in the figure below ：
Above picture ,
Initial Is the result of using initialization parameters , That is, the result of the pre training model ,
Stage one Is to train the model of the first stage ,
Stage two (partial) and
Stage two(whole) These are the two strategies for fine-tuning in the second stage , The former is the fixed feature extraction network , Fine tune the matching network , The latter is that the whole network is fine tuned ;
Stage two(direct) Without the first stage of training , Directly fine tune the results of the network , It should be the result of training a network model directly with the data of each user .
Through the comparison of experimental results , We can draw these four conclusions ：
FashionNet A vs B、C The former is to put feature extraction and matching calculation in one network , The latter two networks separate these two functions , The experimental results show that different networks are used to realize different functions , Better performance ;
FashionNet B vs C The difference between the two networks is in the design of matching networks , The former designs only one matching network , The latter is characteristic of any two different categories of clothing concat after , Enter a separate matching network , The experimental result is that the effect of the latter is better , This shows that separating higher-order relationships into a series of pairwise combinations is a better solution ;
Two stage training strategy Compare the results of phase I and phase II , The latter works better , This also shows that fine tuning is a very useful technology ;
Stage two (partial)vs
Stage two(whole)The experimental results show that the effect of the latter is better , This shows that learning the characteristic representation of a specific user is very helpful for recommendation tasks , But the training time of this strategy will be longer , It needs to recalculate the characteristics of all clothes , therefore , The first method with slightly worse effect will be adopted in practical application , Just fine tune the matching network .
The idea of this paper is to combine the methods of measurement learning and ranking learning , And applied CNN For feature extraction , Then calculate the matching between clothes , The training strategy is also from the training general collocation model to the training personalized collocation model .
however , I still have a little question here ： According to the introduction of the paper , The second step is to train the personalized recommendation model by using the collocation of users , In the end, each user will be trained to get a model , That is to say More users , The more models are trained ？
Of course , This may also be my understanding is not in place , If you read this paper , Also welcome to discuss with me .
Welcome to leave a message , Communicate with me ！
Welcome to my WeChat official account. -- The growth of algorithmic apes , Or scan the QR code below , Let's talk , Learning and progress ！