Editor's note ：
Map search 、 recommendation 、 There are a lot of unstructured data hidden in social scenes such as social recommendation , These data are expressed by engineers as high-dimensional vectors with implicit semantics . In order to better deal with the key problem of high-dimensional vector retrieval , Wang Mengzhao, a master of computer science from Hangzhou University of Electronic Science and technology, explored and realized 「 Nearest neighbor graph index with optimal tradeoff between efficiency and accuracy 」, And at the top of the database VLDB 2021 Publish results on .
As a bridge connecting production and scientific research ,Zilliz The research team has been communicating with the academic community 、 Insight into the frontiers of the field . this , Wang Mengzhao came Z star , From the research motivation 、 Algorithm analysis 、 Experimental observation, optimization discussion and other dimensions, and talk about the latest scientific research achievements .
Here is his dry goods sharing , Click here to get the full text of the paper .
High dimensional data retrieval ： Summary of experiments on approximate nearest neighbor search algorithm based on nearest neighbor graph
Vector retrieval is a lot AI Application is an essential basic module . In recent years , Many excellent algorithms based on nearest neighbor graph have been proposed in academia and industry ANNS Algorithm and related optimization , To deal with the retrieval problem of high-dimensional vectors . But for these algorithms , At present, there is a lack of unified experimental evaluation and comparative analysis .
In order to optimize the algorithm design 、 Further industrial application , We finished the paper 《A Comprehensive Survey and Experimental Comparison of Graph-Based Approximate Nearest Neighbor Search》. This paper focuses on the nearest neighbor graph index which realizes the optimal trade-off between efficiency and accuracy , A review is given 13 A representative correlation algorithm , Experiments were performed on rich real-world data sets . Our contribution can be summarized as follows ：
According to the characteristics of the graph index , We will be based on the nearest neighbor graph ANNS The algorithm is divided into four categories , This provides a new perspective for understanding the existing algorithms . On this basis , We describe the inheritance and development relationship between algorithms , So as to sort out the development context of the algorithm ;
We decompose all algorithms into 7 A unified fine-grained component , They constitute a complete algorithm execution process , Thus, the atomic level analysis of the algorithm is realized . We can fairly evaluate the optimization of one component by controlling the consistency of other components ;
We use a variety of data sets （ Include 8 A real-world dataset , They include videos 、 voice 、 High dimensional vector generated by text and image ） And indicators to evaluate the overall performance of the current algorithm ;
We provide recommendations for optimal algorithms in different scenarios 、 Algorithm optimization guidance 、 Algorithm improvement example 、 Analysis and discussion of research trends and challenges .
According to the following observations , We are right. 13 A method based on nearest neighbor graph ANNS The algorithms are compared, analyzed and experimental reviewed ：
at present , Academic and industrial circles put forward 10 More than two common nearest neighbor graph algorithms , However, there is a lack of reasonable classification and comparative analysis of these algorithms . According to our research , The index structure of these algorithms can be summarized as 4 A basic graph structure , But there are many problems with these diagrams , If the complexity is too high . therefore , Here 4 There are many optimizations based on the kind of graph structure , For example, for relative neighborhood graph （RNG） Optimization includes HNSW、DPG、NGT、NSG、SPTAG And so on .
Many existing papers from “ molecular ” The angle evaluation is based on the nearest neighbor graph ANNS Algorithm （ Macro perspective ）. However , We found that , These algorithms have a unified “ Chemical expression ”, They can also be subdivided down into “ atom ”（ Micro angle ）, from “ atom ” Angle analysis can produce some new discoveries , For example, many algorithms use a certain “ atom ”（HNSW,NSG,KGraph,DPG The route is the same ）.
In addition to the tradeoff between search efficiency and accuracy , Nearest neighbor graph based ANNS Other features of the algorithm （ Included in index building and searching ） It also indirectly affects the final search performance . When the search performance gradually reaches the bottleneck , We are concerned about the efficiency of index building 、 Memory consumption and other indicators are paid attention to .
The advantages of an algorithm are not invariable , The characteristics of data sets play a vital role in it . such as , stay Msong（ Voice generated vector data set ） On NSG The acceleration is HNSW Of 125 times ; However , stay Crawl（ Text generated vector data set ） On HNSW The acceleration is NSG Of 80 times . We performed experimental evaluation on a variety of data sets , So as to form a more comprehensive understanding of the algorithm .
Nearest neighbor graph algorithm “ molecular ” Level analysis
In analyzing the algorithm based on nearest neighbor graph ANNS Algorithm before , First of all, let's introduce 4 A classical graph structure , namely ： Delauneto （DG）、 Relative domain diagram （RNG）、K A picture of the neighborhood （KNNG）、 Minimum spanning tree （MST）. Pictured 1 Shown , The difference of the four graph structures is mainly reflected in the edge selection process , A brief summary is as follows ：DG Ensure arbitrary 3 vertices x, y, z Form a triangle xyz There can be no circumscribed circle inside or on the circle except x, y, z Other vertices ;RNG Make sure (b) There must be no other points in the middle crescent area , The crescent shaped areas here are represented by x and y For the center of a circle ,x And y The intersection area of two circles with a distance of radius ;KNNG Each vertex is connected K The nearest neighbor ;MST Under the condition of ensuring connectivity, the length of all edges （ The distance between the two vertices ） The sum is the smallest .
chart 1 The construction results of four graph structures on the same data set
Next , I will be based on the graph 1 Medium 4 A graph structure to sort out 13 A method based on nearest neighbor graph ANNS Algorithm . In order to avoid the misunderstanding caused by translation , The algorithm name is abbreviated in English , Link to the original paper of the algorithm 、 Some high-quality Chinese introductions 、 See resources for some codes . For a more macroscopic connection between the algorithms, please refer to the table in the paper 2 Sum graph 3.
NSW It's right DG Approximation of , and DG It can ensure that accurate results can be obtained through greedy routing from any vertex （ The recall rate is 1 ）.NSW Is a similar to “ Transportation hub ” Undirected graph of , This leads to a sharp increase in the degree of exit of some vertices , From the table of the paper 11 You know ,NSW The maximum output on some data sets can reach hundreds of thousands .NSW Through incremental plug-in construction , This ensures global connectivity , Paper table 4 In the clear ,NSW The number of connected components is 1.NSW It has the property of small world navigation ： Early in the build , The edges formed are far away , It's like one “ Expressway ”, This will improve the efficiency of search ; At the end of the build , The edges formed are closer , This will ensure the accuracy of the search .
Introduction to Chinese ：https://blog.csdn.net/u011233351/article/details/85116719
HNSW stay NSW There are two optimizations based on ：“ hierarchical ” and “ Edge selection strategy ”. The implementation of hierarchy is more intuitive ： Edges with different distance ranges are presented through hierarchies , In this way, a similar jump table structure can be formed during search , More efficient . The optimization principle of edge selection strategy is ： If you want to connect a vertex K If you're a neighbor ,NSW choice K The nearest one , and HNSW From greater than K Select more discrete neighbors from the nearest vertices （ See references 1）. therefore , Considering the edge selection strategy ,HNSW It's right DG and RNG Approximation of .
original text ：https://ieeexplore.ieee.org/abstract/document/8594636
Introduction to Chinese ：https://blog.csdn.net/u011233351/article/details/85116719
FANNG The edge selection strategy and HNSW It's the same , All right. RNG The approximate .FANNG Than HNSW Put forward earlier , But now HNSW Get more general application , The possible reasons are ：
（1）FANNG The edge selection strategy is implemented based on the nearest neighbor graph constructed by violence , Construction efficiency is very low ;
（2）HNSW Build incrementally and introduce a layered strategy , Construction and search efficiency are very high ;
（3）HNSW Open source code ,FANNG There is no .
NGT Is Yahoo Japan's open source vector retrieval library , The core algorithm is based on nearest neighbor graph index .NGT When constructing a nearest neighbor graph, it is similar to NSW, It's also true DG Approximation of , There are some degree adjustments and optimizations in the follow-up , One of the most effective path optimization is also for RNG Approximation of （ The appendix of the paper B It is also proved that ）.
original text 1：https://link.springer.com/chapter/10.1007/978-3-319-46759-7_2
original text 2：https://arxiv.org/abs/1810.07355
SPTAG Is a vector retrieval library released by Microsoft , Its construction process is based on divide and conquer strategy , That is, divide the data set iteratively , Then, the nearest neighbor graph is constructed on each subset , Then merge subgraphs , Finally, the neighborhood propagation strategy is used to further optimize the nearest neighbor graph . The purpose of the above process is to build a KNNG. In search ,SPTAG The scheme of alternating tree index and graph index is adopted , That is, the point close to the query is obtained from the tree as the starting point of the search on the graph to execute the route , When falling into local optimization, continue to obtain the entry point from the tree index , Repeat the above operation until the termination conditions are met .
original text 1：https://dl.acm.org/doi/abs/10.1145/2393347.2393378
original text 2：https://ieeexplore.ieee.org/abstract/document/6247790
original text 3：https://ieeexplore.ieee.org/abstract/document/6549106
Introduction to Chinese 1：https://blog.csdn.net/whenever5225/article/details/108013045
Introduction to Chinese 2：https://cloud.tencent.com/developer/article/1429751
KGraph It's right KNNG Approximation of , It is a construction scheme of nearest neighbor graph for general metric space . Neighbors based on neighbors are more likely to be neighbors' ideas ,KGraph It can quickly build a high-precision KNNG. Many subsequent algorithms （ such as EFANNA、DPG、NSG、NSSG） Are further optimized on the basis of the algorithm .
original text ：https://dl.acm.org/doi/abs/10.1145/1963405.1963487
Introduction to Chinese ：https://blog.csdn.net/whenever5225/article/details/105598694
EFANNA Is based on KGraph The optimization of the . The difference between them is mainly reflected in the initialization of nearest neighbor graph ：KGraph Is to randomly initialize a nearest neighbor graph , and EFANNA It's through KD The tree initializes a more accurate nearest neighbor graph . Besides , In search ,EFANNA adopt KD The tree gets the entry point , and KGraph Random access point .
original text ：https://arxiv.org/abs/1609.07228
Introduction to Chinese ：https://blog.csdn.net/whenever5225/article/details/104527500
analogy EFANNA,IEH Violence constructs an accurate nearest neighbor graph . In search , It gets the entry point through the hash table .
original text ：https://ieeexplore.ieee.org/abstract/document/6734715/
stay KGraph On the basis of ,DPG Consider the neighbor distribution diversity of vertices , Avoid neighbors who are very close to each other repeatedly edge with the target vertex , Maximize the angle between neighbors , The appendix of the paper 4 Proved DPG Our edge selection strategy is also right RNG Approximation of . Besides ,DPG Finally, the reverse edge is added , Is undirected graph , Therefore, its maximum output is also very high （ See the attached table of the paper 11）.
original text ：https://ieeexplore.ieee.org/abstract/document/8681160
Introduction to Chinese ：https://blog.csdn.net/whenever5225/article/details/106176175
NSG Design ideas and DPG It's almost the same . stay KGraph On the basis of ,NSG adopt MRNG The edge selection strategy considers the uniformity of neighbor distribution .NSG My paper will MRNG The edge selection strategy and HNSW The edge selection strategies are compared , Exemplify MRNG The advantages of . Annexes in the paper 1 Proved NSG This edge selection strategy is similar to HNSW Equivalence of edge selection strategy .NSG The entry point is fixed , Is the vertex closest to the global centroid . Besides ,NSG adopt DFS It is mandatory to be reachable from the entry point to all other points .
original text ：http://www.vldb.org/pvldb/vol12/p461-fu.pdf
Introduction to Chinese ：https://zhuanlan.zhihu.com/p/50143204
NSSG Design ideas and NSG、DPG It's almost the same . stay KGraph On the basis of ,NSSG adopt SSG The edge selection strategy considers the diversity of neighbor distribution .NSSG Think ,NSG Excessive trimming （ See the paper table 4）, by comparison SSG The cutting edge should be relaxed .NSG And NSSG Another important difference is ,NSG Obtain candidate neighbors through greedy routing , and NSSG The candidate neighbors are obtained through the first-order expansion of neighbors , therefore ,NSSG More efficient to build .
original text ：https://ieeexplore.ieee.org/abstract/document/9383170
Introduction to Chinese ：https://zhuanlan.zhihu.com/p/100716181
Simply speaking ,Vamana yes KGraph、HNSW and NSG The combination of the three is optimized . On the edge selection strategy ,Vamana stay HNSW （ or NSG） An adjustment parameter is added on the basis of , The edge selection strategy is HNSW Heuristic edge selection , Take different values and perform edge selection twice .
original text ：http://harsha-simhadri.org/pubs/DiskANN19.pdf
Introduction to Chinese ：https://blog.csdn.net/whenever5225/article/details/106863674
HCNNG Is the only one so far to MST Vector retrieval algorithm for basic composition strategy . similar SPTAG,HCNNG The construction process of is based on divide and conquer strategy , Search through KD The tree gets the entry point .
Introduction to Chinese ：https://whenever5225.github.io/2019/08/17/HCNNG/
Nearest neighbor graph algorithm “ atom ” Level analysis We find all nearest neighbor graph based ANNS All algorithms follow a unified processing flow framework , There are several common modules in this process （ Pictured 2 Of C1->C7）. When a nearest neighbor graph algorithm is deconstructed according to this process , We can easily understand how the algorithm is designed and assembled , This will bring great convenience to the subsequent design of nearest neighbor graph vector retrieval algorithm . Besides , We can also fix other modules , Keep other modules exactly the same , The first mock exam is to evaluate the performance differences of different algorithms in a single module .
chart 2 The unified flow framework of nearest neighbor graph vector retrieval algorithm
Next , We use HNSW and NSG For example , Explain how to implement the diagram 2 Process framework analysis algorithm . Before that , We need to be familiar with the index construction and search process of these two algorithms .
First of all HNSW decompose ,HNSW The construction strategy is incremental . therefore , Every time you insert a data point, you have to execute C1->C3 The process .
HNSW Decomposition process ：
|modular||HNSW Concrete realization|
|C1||The largest layer where the new insertion point is generated ; Get search entry point|
|C2||New insertion point as query point , From the entry point , greedy search , Returns a certain number of nearest neighbors of the new insertion point as neighbor candidates|
|C3||Heuristic edge selection strategy|
|C4||No additional steps , The vertex in the top layer serves as the entrance|
|C5||No additional steps , Incremental builds have implicitly ensured connectivity （ Heuristic edge selection destroys connectivity to a certain extent )|
|C6||The top vertex serves as the entrance|
|C7||Best first search|
NSG Decomposition process ：
|modular||NSG Concrete realization|
|C1||NN-Descent Initialize the nearest neighbor graph|
|C2||Vertex as query , Greedy search for neighbor candidates|
|C3||MRNG Edge selection strategy|
|C4||Global centroid as query , Greedy search obtains the nearest vertex as the entry|
|C5||From the entrance ,DFS Ensure connectivity|
|C6||C4 Access to|
|C7||Best first search|
because HNSW Of C3 And NSG Of C3 It is equivalent. , therefore , As can be seen from the above two tables ,HNSW And NSG There is little difference between the two algorithms . Actually , The thesis involves 13 Many of these algorithms are very similar , For details, see Chapter 4 Chapter .
Experimental observation and discussion
For the specific experimental evaluation, please refer to chapter 5 Chapter , Next, we will briefly introduce the observation results and discussion of the experiment ：
Algorithms in different scenarios are recommended
NSG and NSSG Generally, there is a minimum index construction time and index size , therefore , They are suitable for scenarios where a large amount of data is frequently updated ; If we want to quickly build an accurate KNNG（ Not only for Vector Retrieval ）,KGraph、EFANNA and DPG More suitable for ;DPG and HCNNG There is a minimum average search path length , therefore , They fit the needs I/O Scene ; about LID Difficult data sets with high values ,HNSW、NSG、HCNNG More suitable ; about LID Simple data sets with low values ,DPG、NSG、HCNNG and NSSG More suitable for ;NGT There are smaller candidate set sizes , So it applies to GPU Speed up （ in consideration of GPU Memory limit of ）; When memory consumption is high ,NSG and NSSG fit , Because they take up less memory .
Algorithm design wizard
A practical nearest neighbor graph vector retrieval algorithm generally meets the following four aspects ：
High build efficiency
High routing efficiency
High search accuracy
Low memory load
For the first aspect , Let's not improve the index quality of nearest neighbor graph （ That is, the proportion of real nearest neighbors in the neighbors of a vertex ） Spend too much time on . Because the best picture quality （ It can be measured by the proportion of vertices connected to the edge of the nearest vertex in the graph ） Not necessarily achieve the best search performance （ Combined with the table in the paper 4 Sum graph 7、8）.
For the second aspect , We should control the appropriate average output , The distribution of diverse neighbors , Reduce the cost of obtaining entry points , Optimize routing policy , Thus, the number of distance calculations required to converge to the neighborhood of the query point is reduced .
For the third aspect , We should reasonably design the distribution of neighbors , Ensure connectivity , So as to improve the ability to fall into local optimization " Resistance ".
For the fourth aspect , We should optimize neighbor selection strategy and routing strategy , Thus, the output and candidate set size are reduced .
Optimization algorithm example
Based on the above wizard , We assemble a new nearest neighbor graph algorithm （ chart 3 Medium OA）, The algorithm is shown in Fig 2 Medium 7 Each of the components selects a specific implementation of the existing algorithm , namely C1 use KGraph Implementation of algorithm ;C2 use NSSG Implementation of algorithm ;C3 use NSG Implementation of algorithm ;C4 use DPG Implementation of algorithm ;C5 use NSSG Implementation of algorithm ;C6 use DPG Implementation of algorithm ;C7 use HCNNG and NSW Implementation of algorithm .
OA The algorithm achieves the current optimal comprehensive performance , See the original paper for details . therefore , We don't even have to optimize the current algorithm , Just assembling different parts of the existing algorithm can form a new algorithm .
chart 3 OA The search performance of the algorithm is compared with the current optimal nearest neighbor graph algorithm
Trends and challenges
be based on RNG The design of edge selection strategy plays a key role in improving the efficiency of nearest neighbor graph vector retrieval algorithm . In the evaluation of the paper , The only one based on MST The algorithm of HCNNG It also shows good comprehensive performance .
Based on the above pure nearest neighbor graph algorithm , There are three main directions for follow-up development ：
- Hardware optimization ;
- Machine learning optimization ;
- More advanced query requirements , For example, structured and unstructured hybrid queries .
We will face these three challenges in the future ：
- How to combine data coding technology with graph index to solve the problem of high memory consumption of nearest neighbor graph vector retrieval algorithm ;
- Hardware accelerates the construction of nearest neighbor graph index and reduces the construction time of nearest neighbor graph index ;
- According to the characteristics of different scenes, the optimal nearest neighbor graph algorithm is selected adaptively .
About author ：
Wang Mengzhao , Master of computer science, Hangzhou University of Electronic Science and technology . This paper mainly focuses on vector similarity retrieval based on nearest neighbor graph 、 Research contents such as multimodal Retrieval , And applied for three invention patents in relevant directions , At the top of the database VLDB and SCI Area 1 top Periodical KBS Wait to publish two papers .
GitHub @Milvus-io|CSDN @Zilliz Planet|Bilibili @Zilliz-Planet
Zilliz With the vision of redefining Data Science , Committed to building a global leading open source technology innovation company , And unlock the hidden value of unstructured data for enterprises through open source and cloud native solutions .
Zilliz To build the Milvus Vector database , To speed up the development of the next generation of data platform .Milvus The database is LF AI & Data The foundation's Graduation Program , Able to manage a large number of unstructured data sets , In new drug discovery 、 Recommendation system 、 Chat robot has a wide range of applications .