In the last few years , Great progress has been made in the research of natural language generation for tasks such as text summarization . However , Despite achieving a high level of fluency , The nervous system is still prone to hallucinations （ That is, generate understandable text that is not true to the source ）, This may prevent these systems from being used in many applications that require a high degree of accuracy . The consideration comes from Wikibio An example of a dataset , Which is responsible for summarizing Belgian football players Constant Vanden Stock Of Wikipedia The neural baseline model of the information box entry incorrectly summarizes that he is an American figure skater .
Although the process of evaluating the fidelity of the generated text to the source content can be challenging , But when the source content is structured （ for example , In tabular format ） when , It's usually easier . Besides , Structured data can also test the reasoning and numerical reasoning ability of the model . However , Existing large-scale structured data sets are usually noisy （ That is, the reference sentence cannot be fully inferred from the table data ）, This makes them unreliable for measuring hallucinations in model development .
stay “ ToTTo: A Controlled Table-To-Text Generation Dataset ” in , We show an open field table to text generated dataset , The dataset uses a novel annotation process （ Revised by sentence ） And a controlled text generation task to create , This task can be used to evaluate the model .ToTTo（“Table-To-Text” Abbreviation ） contain 121,000 Training examples , And each for development and testing 7,500 Example . Due to the accuracy of the notes , The data set is suitable as a challenging benchmark for the research of high-precision text generation . Data sets and code in our GitHub Open source on the repository .
ToTTo A controlled generation task is introduced , It contains a given set of selected cells Wikipedia The table is used as the source material for generating a single sentence description task , This description summarizes the contents of the cells in the context of the table . The following example demonstrates some of the many challenges posed by this task , For example, numerical reasoning 、 A large number of open domain vocabularies and different table structures .
Designing the annotation process to obtain natural but clean target sentences from tabular data is a major challenge . Many data sets （ Such as Wikibio and RotoWire） Match naturally occurring text to a table , This is a noisy process , It is difficult to determine whether hallucinations are mainly caused by data noise or model defects . On the other hand , You can lead the annotator to write sentences from scratch , These goals are faithful to the table , However, the generated goals often lack diversity in structure and style .
by comparison ,ToTTo It is constructed using a novel data annotation strategy , The annotator modifies the existing Wikipedia sentences in stages . This causes the target sentence to be clean and natural , Contains interesting and diverse language features . The data collection and annotation process starts with the Wikipedia collection form , The given table is paired with summary sentences collected from the supporting page context according to the heuristic , For example, word overlaps between page text and tables and hyperlinks that reference table data . This summary sentence may contain information that the table does not support , And may contain pronouns that only find antecedents in the table , Not the sentence itself .
The annotator then highlights the cells in the table that support the sentence and deletes the phrase... In the sentences that are not supported in the table . When necessary , They also de context sentences , Make it independent （ for example , Have correct pronoun resolution ） And correct grammar .
We show that annotators achieve high consistency in the above tasks ：0.856 Fleiss Kappa Used to highlight cells , as well as 67.0 BLEU For the final target sentence .
We are right. ToTTo Data sets have been processed over 44 Topic analysis of three categories , Find out Sports and Country The theme , Each topic contains a series of fine-grained topics , for example , Sports football / The Olympic Games and the country's population / Architecture , Share of the data set 56.4%. rest 44% A wider range of topics , Including the performing arts 、 Transportation and entertainment .
Besides , We are more than 100 Different types of language phenomena in a randomly selected sample dataset are manually analyzed . The following table summarizes some examples of page and chapter titles that need to be referenced , And some language phenomena that data set may pose new challenges to the current system .
Language phenomenon percentage
Need to refer to the page title 82%
Need to refer to chapter title 19%
Need to refer to the table description 3%
Reasoning （ Logic 、 Numbers 、 Time and so on ） 21%
enjambment / Column / Cell comparison 13%
Need background information 12%
We show three most advanced models in the literature （BERT-to-BERT、Pointer Generator and Puduppully 2019 Model ） In two evaluation indicators BLEU and PARENT Some baseline results on . In addition to reporting the scores of the overall test set , We also evaluate each model on a more challenging subset of extraterritorial examples . As shown in the following table ,BERT-to-BERT Model in BLEU and PARENT Best in . Besides , The performance of all models on the challenge set is quite low , Show the challenges of extraterritorial generalization .
Blue Parent Blue Parent
Model （ The overall ） （ The overall ） （ Challenge ） （ Challenge ）
BERT To BERT 43.9 52.6 34.8 46.7
Pointer generator 41.6 51.6 32.2 45.2
Puduppully wait forsomeone .2019 year 19.2 29.2 13.9 25.8
Although automatic indicators can give some performance indicators , But they are not enough to evaluate hallucinations in text generation systems . To better understand hallucinations , We manually evaluated the best performing baseline , To determine its fidelity to the contents of the source table , Hypothetical differences indicate hallucinations . For calculation “ Experts ” performance , For each example in our multi reference test set , We provide a reference and ask the annotator to compare it with other references to ensure its fidelity . Results show , The best baseline seems to be around 20% Hallucinating information in time .
Model （ The overall ） （ Challenge ）
Experts 93.6 91.4
BERT To BERT 76.2 74.2
Model errors and challenges
In the table below , We show some observed model errors , To highlight ToTTo Some of the more challenging aspects of the dataset . We found the most advanced models in hallucinations 、 Struggling with numerical reasoning and rare topics , Even with clean references （ Red mistake ）. The last example shows that , Even if the model output is correct , It is sometimes not as good as the original reference that contains more reasoning about tables （ In blue ） Provide information .
Reference resources Model to predict
stay 1939 In the curry cup in , The western province is in Cape Town 17-6 Lost to Transvaal . In the first place Currie The cup is 1939 Annual play transvaal1 In the new - land , Win with the western provinces 17-6.
ibm On 2000 The second generation of micro drives was released in , Capacity increased to 512 mb and 1 gb.2000 Years have 512 A microdrive model ：1 GB.1956 The Motorcycle Grand Prix season in includes 5 A class of 6 A grand prix ：500cc、350cc、250cc、125cc and sidecars 500cc.1956 The Grand Prix motorcycle season includes 8 A grand prix , There are five levels ：500cc、350cc、250cc、125cc and sidecars 500cc.
In Travis · Kyles (travis kelce) In my last college season , He's catching the ball (45)、 Catch yards (722)、 Yards per catch (16.0) And catch the ball (8) All of them have reached a new high in their personal career .travis kelce It's done 2012 season , It's done 45 Catch the ball , Pass the ball 722 code （ averaging 16.0 Time ） and 8 Secondary array .
In this work , We showed ToTTo, This is a large English table to text dataset , It provides a controlled generation task and a data annotation process based on iterative sentence revision . We also provide several state-of-the-art baselines , And proved ToTTo It may be a useful data set , Used for modeling research and development of evaluation indicators that can better detect model improvement .