1. Introduction

Imagine that you are taken with a sudden desire to understand how the fruit of a tropical tree gets transformed into chocolate bars, or want to understand the role of fever in the human body's immune response: how would you go about finding that information?

If your specific question has already been asked and answered clearly and succintly on one of the many question answering platforms available on the Internet (such as Quora, Reddit, or Yahoo Answers), you're in luck: modern search engines will probably take you to that pre-existing answer pretty reliably in a matter of a few clicks.

If no one else has asked the exact question you are interested in, however, the process will be a little more involved. You will likely have to collect relevant information from a variety of sources, figure out how these pieces of knowledge fit together in relation to your query, and synthetize a narrative that answers your initial question.

Now, wouldn't it be great if your computer could do all of that for you: gather the right sources (e.g. paragraphs from relevant Wikipedia pages), synthetize the information, and write up an easy-to-read, original summary of the relevant points? Such a system isn't quite available yet, at least not one that can provide reliable information in its summary. Even though current systems excel at finding an extractive span that answers a factoid question in a given document, they still find open-domain settings where a model needs to find its own sources of information and long answer generation challenging.

Thankfully, a number of recent advances in natural language understanding and generation have made working toward solving this problem much easier! These advances include progress in the pre-training (e.g. BART, T5) and evaluation (e.g. for factuality) of sequence-to-sequence models for conditional text generation, new ways to use language understanding models to find information in Wikipedia (e.g. REALM, DPR), and a new training dataset introduced in the paper ELI5: Long Form Question Answering.

The ELI5 dataset was built by gathering questions that were asked by community members of the r/explainlikeimfive subreddit, along with the answers that were provided by other users. The rules of the subreddit make this data particularly well suited to training a model for abstractive question answering: the questions need to seek an objective explanation about well established facts, and the answers provided need to be understandable to a layperson without any particular knowledge domain.

In this notebook, we show how we can take advantage of these recent advances to train a long form question answering system which takes in a question, fetches 10 relevant passages from a Wikipedia snapshot, and writes a multi-sentence answer based on the question and retrieved passages. In particular, training embedding-based retrieval models to gather supporting evidence for open-domain questions is relatively new research area: the last few months have seen some significant progress in cases where direct supervision is available, or with extensive task-specific pretraining. Here, we show how the ELI5 dataset allows us to train a dense retrieval system without access to either, making dense retrieval models more accessible. See this presentation from the Hugging Face reading group for a non-exhaustive overview of recent work in the field.

Follow along to learn about the steps involved and read some background on the state of the art for some related tasks, or go straight to the:

Live Demo!

(And don't forget to scroll down on the left sidebar to show all of the generation options!)

1.a - Preliminaries

The implementation presented here relies on the Hugging Face 🤗transformers and 🤗nlp libraries. Wikipedia indexing relies on ElasticSearch with its python bindings for the sparse version, and faiss for the dense version. You can get all of these by running:

pip install elasticsearch
pip install faiss_gpu
pip install nlp
pip install transformers
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz
tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz

The training relies on two datasets: ELI5, a processed version of the r/explainlikeimfive subreddit, and the Wiki40b Wikipedia image. You can download both using the 🤗nlp linrary with:

In [1]:
import nlp
eli5 = nlp.load_dataset('eli5')
wiki40b_snippets = nlp.load_dataset('wiki_snippets', name='wiki40b_en_100_0')['train']

Additionally, all of the useful methods used in this notebook are compiled in the lfqa_utils.py script:

In [2]:
from lfqa_utils import *

1.b - Note on Data and Biases

Before we go any further, let us take a moment to talk about the provenance of our training data. While Reddit hosts a number of thriving communities with high quality discussions, it is also widely known to have corners where sexism, hate, and harassment are significant issues. See for example the recent post from Reddit founder u/spez outlining some of the ways he thinks the website's historical policies have been responsible for this problem, Adrienne Massanari's 2015 article on GamerGate and follow-up works, or a 2019 Wired article on misogyny on Reddit.

While there has been some recent work in the NLP community on de-biasing models (e.g. Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings for word embeddings trained specifically on Reddit data), this problem is far from solved, and the likelihood that a trained model might learn the biases present in the data remains a significant concern.

As mentioned above, the magnitude of the problem depends on the specific communities/subreddits. This work uses data from r/explainlikeimfive, and the nlp library also gives access to examples from r/askscience, and r/AskHistorians. There are some encouraging signs for all of these communities: r/explainlikeimfive and r/askscience have similar structures and purposes, and r/askscience was found in 2015 to show medium supportiveness and very low toxicity when compared to other subreddits (see a hackerfall post, thecut.com write-up and supporting data). Meanwhile, the r/AskHistorians rules mention that the admins will not tolerate "racism, sexism, or any other forms of bigotry".

This is obviously not enough to exonerate the model (the pre-training step, for example, raises its own questions on that topic), and there is still a lot of interesting work to do to be able to quantify the biases in a conditional text generation model. One thing you can do to help: if you find any particularly egregious answers provided by the model when using the demo, or want to collaborate on this research question please send a DM to @YJernite on Twitter!

2. Task and Data Description

Let's recap: we are interested in the task of Long Form Question Answering. As in other Question Answering tasks, the model is presented with a question, and is required to generate a natural language answer. Whereas a majority of QA datasets contain mostly factoid questions, where the answer, such as a date or the name of a single entity, can be expressed in a few words or single sentence, Long Form QA focuses on questions which call for an explanation consisting of a few sentences or a few paragraphs.

In order to teach a model to answer such questions, we use questions and answers written by Reddit users. Note that the nlp.load_dataset command above actually downloaded questions and their associated answers from the r/explainlikeimfive, r/askscience, and r/AskHistorians subreddits. We focus here on the ELI5/explainlikeimfive part to train the system, as these examples tend to be a little simpler.

Let's look at one item from the test set:

In [3]:
{'q_id': '8houtx',
 'title': 'Why does water heated to room temperature feel colder than the air around it?',
 'selftext': '',
 'document': '',
 'subreddit': 'explainlikeimfive',
 'answers': {'a_id': ['dylcnfk', 'dylcj49'],
  'text': ["Water transfers heat more efficiently than air. When something feels cold it's because heat is being transferred from your skin to whatever you're touching. Since water absorbs the heat more readily than air, it feels colder.",
   "Air isn't as good at transferring heat compared to something like water or steel (sit on a room temperature steel bench vs. a room temperature wooden bench, and the steel one will feel more cold).\n\nWhen you feel cold, what you're feeling is heat being transferred out of you.  If there is no breeze, you feel a certain way.  If there's a breeze, you will get colder faster (because the moving air is pulling the heat away from you), and if you get into water, its quite good at pulling heat from you.   Get out of the water and have a breeze blow on you while you're wet, all of the water starts evaporating, pulling even more heat from you."],
  'score': [5, 2]},
 'title_urls': {'url': []},
 'selftext_urls': {'url': []},
 'answers_urls': {'url': []}}

So here we have the question:

Why does water heated to room temperature feel colder than the air around it?

This definitely requires a multi-step explanation: no single phrase can sum up all of the information we are looking for. Here are the answers that were given on ELI5, and were given scores of +5 and +2 respectively by Reddit users:

  1. Water transfers heat more efficiently than air. When something feels cold it's because heat is being transferred from your skin to whatever you're touching. Since water absorbs the heat more readily than air, it feels colder.

  2. Air isn't as good at transferring heat compared to something like water or steel (sit on a room temperature steel bench vs. a room temperature wooden bench, and the steel one will feel more cold). When you feel cold, what you're feeling is heat being transferred out of you. If there is no breeze, you feel a certain way. If there's a breeze, you will get colder faster (because the moving air is pulling the heat away from you), and if you get into water, its quite good at pulling heat from you. Get out of the water and have a breeze blow on you while you're wet, all of the water starts evaporating, pulling even more heat from you.

First, note that in this case we have two answers which broadly describe the same phenomenon: the first one is scored higher because it is more succint and to the point. This example already illustrates one important feature of the LFQA task: there are usually several valid ways to answer a given question. Of the 272K examples in the ELI5 training set, nearly two thirds (167K) have at least two answers. We'll need to keep this in mind when training and evaluation of the model.

Secondly, we need to give our model access to the information that is expressed in both these answers. Recently released models have been shown to include a significant amount of world knowledge in their parameters without the need of any external knowledge at all (see e.g. the Closed-book QA performance of the T5 model). There are several advantages to giving the model explicit access to information in text form however. First, a larger number of parameters in a model implies a larger computational cost. Secondly, getting information from a text database allows us to easily update the model's knowledge without having to re-train its parameters.

Overview of the full question answering process.
First, the Document Retriever selects a set of passages from Wikipedia that have information relevant to the question.
Then, the Answer Generation Model reads the concatenation of the question and retrieverd passages, and writes out the answer.

Here, we choose to give the model access to Wikipedia text. Full Wikipedia articles are typically too long for most current models to handle, and notable exceptions like the Reformer or Longformer architectures unfortunately do not yet have pre-trained sequence-to-sequence variants. Thus, we follow previous work in splitting Wikipedia articles into disjoint snippets of 100 words, and keep track of the title of the article and sections a snippet came from. Here's how you can get a pre-processed Wiki40b version split into 100-word passages with the nlp library, and an example snippet which has some of the information we're looking for ("little conduction would occur since air is a poor conductor of heat"):

In [4]:
{'_id': '{"nlp_id": 1665419, "wiki_id": "Q179635", "sp": 12, "sc": 653, "ep": 12, "ec": 1223}',
 'nlp_id': 1665419,
 'wiki_id': 'Q179635',
 'start_paragraph': 12,
 'start_character': 653,
 'end_paragraph': 12,
 'end_character': 1223,
 'article_title': 'Heat transfer',
 'section_title': 'Conduction',
 'passage_text': 'from one place to another place without the movement of particles is called conduction, such as when placing a hand on a cold glass of water - heat is conducted from the warm skin to the cold glass, but if the hand is held a few inches from the glass, little conduction would occur since air is a poor conductor of heat. Steady state conduction is an idealized model of conduction that happens when the temperature difference driving the conduction is constant, so that after a time, the spatial distribution of temperatures in the conducting object does not change any'}

In the next two sections, we show how we can use either a sparse retriever or a trained dense retriever to automatically find relevant snippets for a question.

3. Sparse Retrieval: Making Support Documents with ElasticSearch

In this section, we show how to use either such a "classical" Information Retrieval (IR) system based on sparse word matching with ElasticSearch, an extremely popular and efficient search engine that can be used for finding documents that match a given query based on word overlap.

Specifically, ElasticSearch provides a convenient way to index documents so they can easily be queried for nearest neighbor search using the BM25 similarity function (which relies on TF-IDF weighting of words). While this word-matching based approach has obvious limitations, such as failing to take synonyms and sometimes grammatical variation into account, it does pretty well overall and has only recently been overtaken by embedding-based systems for Wikipedia-based Open-Domain QA tasks.

In order to use ElasticSearch, you will first need to launch a server. In a different window, run:


By default, your ElasticSearch server will be listening on localhost port 9200. To connect to it run:

In [5]:
es_client = Elasticsearch([{'host': 'localhost', 'port': '9200'}])

The eli5_utils.py script provides utilities to create (make_es_index_snippets) and query (query_es_index) an ElasticSearch index from within Python.

The main implementation details are:

  1. We index the article title, section title, and text of each of the passages for BM25 passages. These choices are implemented in the index_config variable:
    index_config = {
       "settings": {
         "number_of_shards": 1,
       "mappings": {
         "properties": {
           "article_title": {"type": "text", "analyzer": "standard", "similarity": "BM25"},
           "section_title": {"type": "text", "analyzer": "standard", "similarity": "BM25"},
           "passage_text": {"type": "text", "analyzer": "standard", "similarity": "BM25"}
  2. To query the index, we found it useful to add a few task-dependent stop-words. The query text is then compared to all of the indexed fields, giving more weight to the passage text:
    banned = ['how', 'why', 'what', 'where', 'which', 'do', 'does', 'is', '?', 'eli5', 'eli5:']
     q = ' '.join([w for w in q.split() if w not in banned])
     response = es_client.search(
         index = index_name,
         body = {
             "query": {
                 "multi_match": {
                     "query": q,
                     "fields": ["article_title", "section_title", "passage_text^2"],
                     "type": "cross_fields",
             "size": n_results,

Here's the command to create the index, it should take one to three hours depending on your system.

In [6]:
if not es_client.indices.exists('wiki40b_snippets_100w'):
    make_es_index_snippets(es_client, wiki40b_snippets, index_name='wiki40b_snippets_100w')

Now let's test the ElasticSearch retriever with our running example ELI5 question about skin-to-water heat transfer by returning the 10 best candidate passages:

In [7]:
question = eli5['test_eli5'][12345]['title']
doc, res_list = query_es_index(question, es_client, index_name='wiki40b_snippets_100w')

df = pd.DataFrame({
    'Article': ['---'] + [res['article_title'] for res in res_list],
    'Sections': ['---'] + [res['section_title'] if res['section_title'].strip() != '' else res['article_title']
                 for res in res_list],
    'Text': ['--- ' + question] + [res['passage_text'] for res in res_list],
df.style.set_properties(**{'text-align': 'left'})
Article Sections Text
0 --- --- --- Why does water heated to room temperature feel colder than the air around it?
1 Salt fingering Salt fingering Salt fingering Salt fingering is a mixing process that occurs when relatively warm, salty water overlies relatively colder, fresher water. It is driven by the fact that heated water diffuses more readily than salty water. A small parcel of warm, salty water sinking downwards into a colder, fresher region will lose its heat before losing its salt, making the parcel of water increasingly denser than the water around it and sinking further. Likewise, a small parcel of colder, fresher water will be displaced upwards and gain heat by diffusion from surrounding water, which will then make it lighter than the
2 Solar water heating Flat plate & Evacuated tube protected by a glass panel. Consequently, these types of collectors are much less efficient when water temperature exceeds ambient air temperatures. For pool heating applications, the water to be heated is often colder than the ambient roof temperature, at which point the lack of thermal insulation allows additional heat to be drawn from the surrounding environment. Evacuated tube Evacuated tube collectors (ETC) are a way to reduce the heat loss, inherent in flat plates. Since heat loss due to convection cannot cross a vacuum, it forms an efficient isolation mechanism to keep heat inside the collector pipes. Since two flat
3 Humidifier Fixed-installation humidifiers & Problems to damage from overly dry air. In colder months, they may provide modest energy savings, since as humidity increases, occupants may feel warm at a lower temperature. Bypass humidifiers are connected between the heated and cold air return ducts, using the pressure difference between these ducts to cause some heated air to make a bypass through the humidifier and return to the furnace. Any humidifiers should usually be disabled during the summer months if air conditioning is used; air conditioners partially function to reducing indoor humidity, and having a humidifier continue to operate will waste significant amounts of energy. Problems The USEPA
4 Drake Landing Solar Community How it works & Energy centre the short-term storage tanks in the Energy Centre to be heated again in order to complete the circuit. During colder months the water from the BTES passes back to the short-term storage tank and is then directed to each home. Similar to a hot water tank, the heated water goes through a heat exchanger that blows air across the warm fan coil. Heat travels from the water to the air and is directed through the house via ductwork. When the temperature reaches that said on the thermostat, an automatic valve shuts off the heat transfer unit. Energy centre The Energy
5 Diamond dust Characteristics & Formation it looks like many tiny diamonds are flashing in the air. Formation These ice crystals usually form when a temperature inversion is present at the surface and the warmer air above the ground mixes with the colder air near the surface. Since warmer air frequently contains more water vapor than colder air, this mixing will usually also transport water vapor into the air near the surface, causing the relative humidity of the near-surface air to increase. If the relative humidity increase near the surface is large enough then ice crystals may form. To form diamond dust the temperature must be below
6 Effects of global warming on oceans Ocean currents changing latitudes of our planet. As the atmosphere is warmed nearest the equator, the hot air at the surface of our planet is heated, causing it to rise and draw in cooler air to take its place, creating what is known as circulation cells. This ultimately causes the air to be significantly colder near the poles than at the equator. Wind patterns associated with these circulation cells drive surface currents which push the surface water to the higher latitudes where the air is colder. This cools the water down enough to where it is capable of dissolving more gasses and minerals,
7 Mesoscale convective system Lake-effect snow for lake-effect rain or snow to form, the air moving across the lake must be significantly cooler than the surface air (which is likely to be near the temperature of the water surface). Specifically, the air temperature at the altitude where the air pressure is 850 millibars (or 1.5 kilometres (0.93 mi) altitude) should be 13 °C (24 °F) lower than the temperature of the air at the surface. Lake-effect occurring when the air at 850 millibars is 25 °C (45 °F) colder than the water temperature can produce thundersnow, snow showers accompanied by lightning and thunder (due to the larger amount of energy
8 Thermal comfort Interplay of temperature and humidity such as the heat index. For lower temperatures, a related interplay was identified only qualitatively: High humidity and low temperatures cause the air to feel chilly. Cold air with high relative humidity "feels" colder than dry air of the same temperature because high humidity in cold weather increases the conduction of heat from the body. There has been controversy over why damp cold air feels colder than dry cold air. Some believe it is because when the humidity is high, our skin and clothing become moist and are better conductors of heat, so there is more cooling by conduction. For more recent data look for
9 Honyaki Traditional process heat. The quench water or oil is prepared and brought to the right temperature. The forge is heated. Lights are turned off and the room shut from the outside. Once ready, the blade is buried and shuffled around in the charcoal and when it reaches the correct temperature it is thrust into water and moved forward and back (so as to prevent lateral distortion) and then after a couple seconds side to side. The knife could also be brought up slightly above temperature and then held in the room to the correct temperature before quench. The quench could be interrupted
10 Greywell Tunnel SSSI creates an ideal micro-climate for the bats, which is maintained at around 10 °C (50 °F) all year. When the temperature outside the tunnel is colder than this, cold air flows into the bottom of the tunnel where it is warmed by the water, and warmer air flows out along the top of the tunnel. During the summer the air flow is reversed, with warm air flowing into the top of the tunnel and being cooled as it flows back out over the water. By 2006, there were some 12,500 bats roosting in the tunnel, which included the largest known colony of

We can immediately see both the strengths and limitations of this approach. The system manages to retrieve documents that are all broadly on topic, mentioning some combination of water, air, relative temperature, and temperature transfer. In spite of this, only example 8 ends up containing information that is actually relevant to the question:

Cold air with high relative humidity "feels" colder than dry air of the same temperature because high humidity in cold weather increases the conduction of heat from the body.

We got lucky this time, but this passage could as easily have been ranked 11th and not been included in the support document we provide to the answer generation system. As it is, the model will have to sort through mostly off-topic information to find this sentence when reading the resulting supporting document.

4. Retrieving Support Documents with an ELI5-Trained Dense Model

The sparse retriever works by finding passages which feature the words from the query. However, it has no way to know a priori which of these words are more important in context, and seems to struggle with understanding the central theme of the query (human-perceived temperature).

Thankfully, some recent works have taken advantage of advances in pre-trained contextual word representations to solve this problem. Models such as DPR or REALM for example learn to compute a vector representation of the query, as well as vector representations of Wikipedia passages in such a way that the passages that best answers a question maximize the dot product between the two representations. Retrieval is then reduced to a Maximum Inner Product Search, which can be executed efficiently using systems like FAISS.

These successes are very encouraging for our Open-Domain Long Form QA application. However, our task and setup do not quite meet the requirements of either of either of these approaches. On the one hand, the DPR system is trained using gold passage annotations: most major QA dataset tell the system which Wikipedia passage contains the answer. Unfortunately, we do not have such annotations for the ELI5 data. On the other hand, while REALM is trained without passage supervision, it requires a pretty expensive pre-training step with an Inverse Cloze Task (100,000 steps with batch size 4096), and the ability to re-compute the embeddings of all Wikipedia passages regularly during training.

In order to train a similar dense retrieval system at reduced cost without having access to gold passage annotation, we will have to take advantage of another unique feature of our dataset, namely the fact that the long form answers are quite similar in style to the Wikipedia passages we want to index. Our hypothesis then is that if we train a system to embed the questions and answers in our dataset in a way that allows us to easily match questions to answers, then using the answer embedder on Wikipedia passages should allow us to similarly match questions to supporting evidence from Wikipedia.

4.a - Contrastive Training with ELI5 In-Batch Negatives

As mentioned above, we want to train a system to produce question and answer embeddings, such that the dot product between the representation of a question and any of its answers is greater than between it and answers of all of the other questions in the dataset.

Unfortunately, actually comparing all questions to all answers before taking every single gradient step is computationally prohibitive: instead, we follow previous work in simply processing medium to large batches of question-answer pairs, and making sure that the dot product of a question with its answer is larger than with all other answers in the batch, and vice versa.

We use a cross-entropy loss for the multinomial distribution over all of the answers (or questions) in a batch, and make use of PyTorch gradient checkpointing to be able to use large batches with limited GPU memory: you can find all implementation details in the RetrievalQAEmbedder class in eli5_utils.py.

To train the retriever, we show the model batches of 512 question-answer pairs.
The model needs to ensure that the embedding of each question in the batch is closer to the embedding
of its corresponding answer than to the embedding of any other answer in the batch.

We use a single BERT-style pre-trained model to embed the questions and answers, and learn different projection matrices to bring both representations down to dimension 128: the projection matrices are trained from scratch as the sentence embedding model is fine-tuned. We found that the 8-layer distilled version of BERT from the Well-Read Students Learn Better paper performed as well or better as full BERT for a notable gain in computation speed: if you want an even faster model, that work provides pre-trained models spanning the full range of computation/accuracy trade-offs.

The model can than be trained with the following code: with batch size 32/512 on a single 16GB GPU, you can run 10 training epochs in under 6 hours.

In [ ]:
# training arguments
class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 100
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "retriever_models/eli5_retriever_model_l-8_h-768_b-512-512"
        self.learning_rate = 2e-4
        self.num_epochs = 10

qar_args = ArgumentsQAR()

# prepare torch Dataset objects
qar_train_dset = ELI5DatasetQARetriver(eli5['train_eli5'], training=True)
qar_valid_dset = ELI5DatasetQARetriver(eli5['validation_eli5'], training=False)

# load pre-trained BERT and make model
qar_tokenizer, qar_model = make_qa_retriever_model(

# train the model
train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args)

If you don't want to train the model yourself, you can also download trained weights from the Hugging Face model repository with:

In [5]:
qar_tokenizer = AutoTokenizer.from_pretrained('yjernite/retribert-base-uncased')
qar_model = AutoModel.from_pretrained('yjernite/retribert-base-uncased').to('cuda:1')
_ = qar_model.eval()

Once the model is trained, it can be used to compute passage embeddings for all Wikipedia snippets. The make_qa_dense_index method takes advantage of numpy memory-mapping, so embeddings are written directly to disk. Again with a single GPU, computing the full set of passage embeddings should take about 18 hours.

In [6]:
if not os.path.isfile('wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat'):
        qar_model, qar_tokenizer, wiki40b_snippets, device='cuda:0',

4.b - Using the Trained Dense Retriever and Wikipedia Index

Now that we have trained our model to compute query and answer embeddings and used it to compute passage embeddings for all our Wikipedia snippets, let's see whether it can actually find supporting evidence for a new question. Recall the the two steps to using the dense retriever: we first compute an embedding for a new question, then do Max Inner Product Search with the pre-computed passage representations.

At test time, the Retriever Model encodes the question, and compares its embedding to the pre-computed representation of
all the Wikipedia passages. The ten passages with the closest embedding are returned to create the support document.

The MIPS part can be executed efficiently with the faiss library. Additionally, since we computed 128-dimensional passage embeddings, the whole of the representations fits on a GPU, making retrieval even faster. We can create the faiss_gpu index with the following code:

In [7]:
faiss_res = faiss.StandardGpuResources()
wiki40b_passage_reps = np.memmap(
            dtype='float32', mode='r',
            shape=(wiki40b_snippets.num_rows, 128)

wiki40b_index_flat = faiss.IndexFlatIP(128)
wiki40b_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 1, wiki40b_index_flat)

Now we can use the query_qa_dense_index function to query the dense index for our running example question about perceived temperature:

In [8]:
question = eli5['test_eli5'][12345]['title']
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, wiki40b_snippets, wiki40b_gpu_index, device='cuda:1')

df = pd.DataFrame({
    'Article': ['---'] + [res['article_title'] for res in res_list],
    'Sections': ['---'] + [res['section_title'] if res['section_title'].strip() != '' else res['article_title']
                 for res in res_list],
    'Text': ['--- ' + question] + [res['passage_text'] for res in res_list],
df.style.set_properties(**{'text-align': 'left'})
Article Sections Text
0 --- --- --- Why does water heated to room temperature feel colder than the air around it?
1 Heat transfer Heat transfer in the human body & Evaporative cooling when the skin is completely wet. The body continuously loses water by evaporation but the most significant amount of heat loss occurs during periods of increased physical activity. Evaporative cooling Evaporative cooling happens when water vapor is added to the surrounding air. The energy needed to evaporate the water is taken from the air in the form of sensible heat and converted into latent heat, while the air remains at a constant enthalpy. Latent heat describes the amount of heat that is needed to evaporate the liquid; this heat comes from the liquid itself and the surrounding gas and surfaces.
2 Johan Sandström Sandström Theorem at greater pressures. There is an ambiguity, however, as to the meaning of the terms 'heating' and 'cooling' in Sandstrom's theorem. So far, heating and cooling has always been interpreted in the literature as being associated with 'surface heating' and 'surface cooling' respectively. In real fluids, however, molecular and turbulent diffusion always cause internal heating/cooling even in absence of external heating/cooling, as long as the temperature of the fluid considered is non-uniform. As is well-known, molecular and turbulent diffusion tends to relax the system toward thermodynamic equilibrium, i.e., toward an isothermal state, which for a statically stable fluid, will warm up
3 Thermal equilibrium Bodies prepared with separately uniform temperatures, then put into purely thermal communication with each other are not in a relation of thermal equilibrium, heat will flow from the hotter to the colder, by whatever pathway, conductive or radiative, is available, and this flow will continue until thermal equilibrium is reached and then they will have the same temperature. One form of thermal equilibrium is radiative exchange equilibrium. Two bodies, each with its own uniform temperature, in solely radiative connection, no matter how far apart, or what partially obstructive, reflective, or refractive, obstacles lie in their path of radiative exchange, not moving relative to one another, will exchange thermal radiation, in net the hotter transferring energy to
4 Evaporative cooler Physical principles air condition and moving along a line of constant enthalpy toward a state of higher humidity. A simple example of natural evaporative cooling is perspiration, or sweat, secreted by the body, evaporation of which cools the body. The amount of heat transfer depends on the evaporation rate, however for each kilogram of water vaporized 2,257 kJ of energy (about 890 BTU per pound of pure water, at 95 °F (35 °C)) are transferred. The evaporation rate depends on the temperature and humidity of the air, which is why sweat accumulates more on humid days, as it does not evaporate fast enough. Vapor-compression refrigeration uses evaporative cooling,
5 Thermal contact conductance Factors influencing contact conductance & Contact pressure Thermal contact conductance In physics, thermal contact conductance is the study of heat conduction between solid bodies in thermal contact. The thermal contact conductance coefficient, , is a property indicating the thermal conductivity, or ability to conduct heat, between two bodies in contact. The inverse of this property is termed thermal contact resistance. Factors influencing contact conductance Thermal contact conductance is a complicated phenomenon, influenced by many factors. Experience shows that the most important ones are as follows: Contact pressure For thermal transport between two contacting bodies, such as particles in a granular medium, the contact pressure is the factor
6 Thermodynamic temperature The heat of phase changes to completely boil or vaporize water (what is known as enthalpy of vaporization) is roughly 540 times that required for a one-degree increase. Water's sizable enthalpy of vaporization is why one's skin can be burned so quickly as steam condenses on it (heading from red to green in Fig. 7 above). In the opposite direction, this is why one's skin feels cool as liquid water on it evaporates (a process that occurs at a sub-ambient wet-bulb temperature that is dependent on relative humidity). Water's highly energetic enthalpy of vaporization is also an important factor underlying why solar pool covers (floating, insulated blankets that
7 Temperature Local thermodynamic equilibrium & Bodies in thermodynamic equilibrium and this is because temperature is an intensive variable. Bodies in thermodynamic equilibrium For experimental physics, hotness means that, when comparing any two given bodies in their respective separate thermodynamic equilibria, any two suitably given empirical thermometers with numerical scale readings will agree as to which is the hotter of the two given bodies, or that they have the same temperature. This does not require the two thermometers to have a linear relation between their numerical scale readings, but it does require that the relation between their numerical readings shall be strictly monotonic. A definite sense of greater hotness can
8 Latent heat Usage phase of atmospheric or ocean water, vaporization, condensation, freezing or melting, whereas sensible heat is energy transferred that is evident in change of the temperature of the atmosphere or ocean, or ice, without those phase changes, though it is associated with changes of pressure and volume. The original usage of the term, as introduced by Black, was applied to systems that were intentionally held at constant temperature. Such usage referred to latent heat of expansion and several other related latent heats. These latent heats are defined independently of the conceptual framework of thermodynamics. When a body is heated at constant temperature
9 Sensible heat Sensible heat Sensible heat Sensible heat is heat exchanged by a body or thermodynamic system in which the exchange of heat changes the temperature of the body or system, and some macroscopic variables of the body or system, but leaves unchanged certain other macroscopic variables of the body or system, such as volume or pressure.
10 Heat transfer Overview & Conduction changes. Conduction On a microscopic scale, heat conduction occurs as hot, rapidly moving or vibrating atoms and molecules interact with neighboring atoms and molecules, transferring some of their energy (heat) to these neighboring particles. In other words, heat is transferred by conduction when adjacent atoms vibrate against one another, or as electrons move from one atom to another. Conduction is the most significant means of heat transfer within a solid or between solid objects in thermal contact. Fluids—especially gases—are less conductive. Thermal contact conductance is the study of heat conduction between solid bodies in contact. The process of heat transfer

The retrieved documents are quite different from the ones returned by the sparse retrieval, with a greater focus on how water helps draw heat from a body, either through evaporation or through better conduction, which is information the model needs to answer this question.

The retriever still misses out on one aspect of the query: the way the question is formulated implies that in the considered scenario the person is immersed in water rather than just wet, which makes the "latent heat" and evaporation arguments a little less relevant, but that's a really subtle distinction!

4.c - Retriever Model Evaluation

We have trained a retrieval model that seems to be working a little better than the traditional word-matching based approach, at least on our running example. Before we use it to actually answer questions, however, we would like to be able to get some quantitative evaluation of the performances of both approaches.

For the retriever, we want to favor recall over precision: our first priority is to make sure that all of the information needed to write the answers is present in the support document. If there is unrelated information, the generation model can learn to sort it out. We measure this by computing the proportion of words in the high-scoring answers which are present in the retrieved support document. To focus on important words, we also weigh answer words by their Inverse Document Frequency. This gives us the following IDF-recall scoring function:

In [12]:
# We first select high-scoring answers (answers beyond the first must have a score of at least 3)
test_qa_list = [(exple['title'],
                ' '.join([a 
                          for i, (a, sc) in enumerate(zip(exple['answers']['text'], exple['answers']['score'])) \
                          if i == 0 or sc >= 3
                for exple in eli5['test_eli5']]

# We then compute word frequencies in answer text
answer_doc_freq = {}
for q, a in test_qa_list:
    for w in a.lower().split():
        answer_doc_freq[w] = answer_doc_freq.get(w, 0) + 1

# The IDF-recall function is then:
def da_idf_recall(doc, answer):
    d_words = dict([(w, True) for w in doc.lower().split()])
    a_words = answer.lower().split()   
    recall = sum([1. / math.log(1 + answer_doc_freq.get(w, 1)) for w in a_words if w in d_words]) / \
                sum([1. / math.log(1 + answer_doc_freq.get(w, 1)) for w in a_words])
    return recall

The evaluate_retriever function in eli5_utils.py takes a retrieval and scoring function and computes both the average retrieval time and score of the document relative the the provided answer. Let's write some short-hand functions for the dense and sparse retrievers with our currently loaded indexes, and evaluate them on the ELI5 test set (be advised that evaluating the retriever on the full test set takes up to two hours):

In [14]:
def dense_ret_for_eval(question, n_ret):
    _, dense_res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer, wiki40b_snippets, wiki40b_gpu_index, n_results=n_ret, device='cuda:1'
    dense_doc = ' '.join([res['passage_text'] for res in dense_res_list])
    return dense_doc

def sparse_ret_for_eval(question, n_ret):
    _, sparse_res_list = query_es_index(
        question, es_client, index_name='wiki40b_snippets_100w', n_results=n_ret
    sparse_doc = ' '.join([res['passage_text'] for res in sparse_res_list])
    return sparse_doc

dense_score = evaluate_retriever(test_qa_list, dense_ret_for_eval, da_idf_recall)
sparse_score = evaluate_retriever(test_qa_list, sparse_ret_for_eval, da_idf_recall)

df = pd.DataFrame({
    'IDF-Recall': [sparse_score['idf_recall'], dense_score['idf_recall']],
    'Time/Query': [sparse_score['retrieval_time'], dense_score['retrieval_time']],
}, index=[ 'Sparse', 'Dense'])
df.style.format({'IDF-Recall': "{:.4f}", 'Time/Query': "{:.4f}"})
IDF-Recall Time/Query
Sparse 0.3212 0.3162
Dense 0.3247 0.0948

This metric obviously has limitations. Since it only looks at individual word matches, it is oblivious to word order or paraphrases among others. However, we can be encouraged by the fact that the dense retriever not only yields higher IDF-recall, it also takes less than a third of the time of the ElasticSearch-based system! Considering these results, we can confidently use it for the next part: training the sequence-to-sequence answer generation system.

5. Generating Answers with a Sequence-to-Sequence Model

Now that we know how to create an evidence document with supporting information for a given question, let's look into training the second component of our system: the answer generation module. We will instantiate it as a sequence-to-sequence model which uses the BART architecture, and initialize it with the bart-large pretrained weights.

In short, the BART paper uses a denoising auto-encoder style objective to pre-train an encoder-decoder model (similarly to how masked language modeling is used to pre-trained BERT-style encoders). Among other applications, they show that large-scale pre-training with their objective followed by fine-tuning on ELI5 data yields the state-of-the-art ROUGE performance for the original version of the dataset (which uses pre-computed support documents made from CommonCrawl pages).

We provide the concatenation of the question and support document as input to the model, and train the decoder to minimize the perplexity of the gold answer. One notable choice is that we train the model using all high-scoring answers in the training set, so the model will see several instances of the same question-document input with different outputs. The supporting passages are separated by a special token <P>, so the input for our running example will look like:

question: Why does water heated to room temperature feel colder than the air around it? context: \<P> when the skin is completely wet. The body continuously loses ... this heat comes from the liquid itself and the surrounding gas and surfaces. \<P> protected by a glass panel. Consequently, these types of collectors... Since heat loss due to convection cannot cross a vacuum, it forms an efficient isolation mechanism to keep heat inside the collector pipes. Since two flat \<P> ... \<P> changes. Conduction On... Fluids—especially gases—are less conductive. Thermal contact conductance is the study of heat conduction between solid bodies in contact. The process of heat transfer

The first thing we do is pre-compute the support documents for the training and validation sets so we can use all available GPUs to train the sequence-to-sequence model. The model is then trained with the train_qa_s2s function in eli5_utils.py. A 16GB GPU accomodates up to two examples at a time, so here is the code to train the model using 4 GPUs with torch.nn.DataPArallel. One epoch should take about 18 hours:

In [ ]:
# pre-computing support documents
eli5_train_docs = []
for example in eli5['train_eli5']:
    support_doc, dense_res_list = query_qa_dense_index(
        example['title'], qar_model, qar_tokenizer, wiki40b_snippets, wiki40b_gpu_index, n_results=n_ret
    eli5_train_docs += [(example['q_id'], support_doc, dense_res_list)]

eli5_valid_docs = []
for example in eli5['validation_eli5']:
    support_doc, dense_res_list = query_qa_dense_index(
        example['title'], qar_model, qar_tokenizer, wiki40b_snippets, wiki40b_gpu_index, n_results=n_ret
    eli5_valid_docs += [(example['q_id'], support_doc, dense_res_list)]

# training loop proper
class ArgumentsS2S():
    def __init__(self):
        self.batch_size = 8
        self.backward_freq = 16
        self.max_length = 1024
        self.print_freq = 100
        self.model_save_name = "seq2seq_models/eli5_bart_model"
        self.learning_rate = 2e-4
        self.num_epochs = 3

s2s_args = ArgumentsS2S()

eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))
s2s_train_dset = ELI5DatasetS2S(eli5['train_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(eli5['validation_eli5'], document_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)

qa_s2s_tokenizer, pre_model = make_qa_s2s_model(
qa_s2s_model = torch.nn.DataParallel(pre_model)

train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args)

Again, if you don't want to train the model yourself, we made trained weights available on the Hugging Face model repository , which you can download with:

In [9]:
qa_s2s_tokenizer = AutoTokenizer.from_pretrained('yjernite/bart_eli5')
qa_s2s_model = AutoModelForSeq2SeqLM.from_pretrained('yjernite/bart_eli5').to('cuda:0')
_ = qa_s2s_model.eval()

We now have everything we need to answer any question! Now let's try the full system on our running example along with the first four questions of the test set:

In [10]:
questions = []
answers = []

for i in [12345] + [j for j in range(4)]:
    # create support document with the dense index
    question = eli5['test_eli5'][i]['title']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        wiki40b_snippets, wiki40b_gpu_index, device='cuda:1'
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
    questions += [question]
    answers += [answer]

df = pd.DataFrame({
    'Question': questions,
    'Answer': answers,
df.style.set_properties(**{'text-align': 'left'})
Question Answer
0 Why does water heated to room temperature feel colder than the air around it? It doesn't feel colder than the air around it, it feels colder than your body temperature. Water is a better conductor of heat than air, so it takes more energy to heat it up than air is to cool it down. So when you heat water to room temperature, it takes away more energy from your body than air does.
1 Why do you get chills/goosebumps from hearing large crowds sing along to songs? I get goosebumps when I sing along to a song. I don't know what causes it, but it happens to me all the time. I think it has something to do with the fact that when you sing along with a song, your brain releases endorphins, which make you feel good.
2 How did studded leather and heavy eye makeup come to be the Hollywood dress code for dystopian, post-apocalyptic societies? Studded leather and heavy eye makeup have been around for a long time. It's not a new thing, it's been a Hollywood thing since the 80s and 90s. URL_0 > Studded and eye makeup were popularized by Alfred Hitchcock in the 1950s and 1960s, and a number of occurrences of the style in films were mentioned - Charlize Theron in Aeon Flux, Milla Jovovich in the fourth and fifth film of the Fourth and fifth movie of the Star Wars franchise, Angelina Jolie in The Last Airbender, and Raquel Croft in Raiders of the Lost Ark. According to Alfred Hitchcock's To Catch a Thief, there was a trend of women wearing pouty lips, Pouty Lips, pursed lips, etc. and a lot of depictions of the female role of the protagonist in a dystopian, post-apocalyptic society.
3 What's the difference between a bush, a shrub, and a tree? A tree is a living thing. A shrub is a kind of plant. A bush is a type of plant that grows in the ground. A tree has a trunk, a branch, and a trunk. The trunk is the main part of the tree, the branch is the part that grows into the trunk.
4 Why is it hard to breathe with a strong air gust blowing straight at your face? It's not hard to breathe with a strong air gust blowing straight at your face. It's hard to breath when the wind is blowing in the opposite direction. The wind is pushing the air away from your face, so it's harder for your lungs to get the air they need to work against the wind.

We made it, and a lot of these answers actually make sense! The model seems to sometimes struggle with coherence and with starting some of the answers, but we're getting some pretty good information overall.

The last thing we'll do is see how we can get a quantitative evaluation of the model performance. Here, we'll use the ROUGE implementation provided in the nlp library.

Note that it is a different implementation than the one used in the BART and ELI5 papers: the rouge Python package they use normalises all numerical values, among other pre-processing choices, leading to higher numbers. We reproduce their evaluation in the Appendix section, but recommend using the more sensitive metric provided by the nlp package, which can be computed with:

In [20]:
predicted = []
reference = []

# Generate answers for the full test set
for i in range(eli5['test_eli5'].num_rows):
    # create support document with the dense index
    question = eli5['test_eli5'][i]['title']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        wiki40b_snippets, wiki40b_gpu_index, device='cuda:1'
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
    predicted += [answer]
    reference += [eli5['test_eli5'][i]['answers']['text'][0]]
In [21]:
# Compare each generation to the fist answer from the dataset
nlp_rouge = nlp.load_metric('rouge')

scores = nlp_rouge.compute(
    predicted, reference,
    rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
    use_agregator=True, use_stemmer=False
df = pd.DataFrame({
    'rouge1': [scores['rouge1'].mid.precision, scores['rouge1'].mid.recall, scores['rouge1'].mid.fmeasure],
    'rouge2': [scores['rouge2'].mid.precision, scores['rouge2'].mid.recall, scores['rouge2'].mid.fmeasure],
    'rougeL': [scores['rougeL'].mid.precision, scores['rougeL'].mid.recall, scores['rougeL'].mid.fmeasure],
}, index=[ 'P', 'R', 'F'])
df.style.format({'rouge1': "{:.4f}", 'rouge2': "{:.4f}", 'rougeL': "{:.4f}"})
rouge1 rouge2 rougeL
P 0.3025 0.0609 0.1708
R 0.2946 0.0587 0.1797
F 0.2561 0.0504 0.1489

That's it for today! And once again, if you want to play with the model a bit more and ask it whatever question comes to mind, please feel free to head over to:

Our Live Demo!

Thank you for reading!


Here we reproduce the ROUGE evaluation from the original ELI5 paper to be able to comparable our performance to theirs. Our generation setting leads to lower ROUGE-1 and ROUGE-2 than the state-of-the-art reported in BART (30.6 and 6.2 respectively), and higher ROUGE-L (24.3).

In [22]:
from nltk import PorterStemmer
from rouge import Rouge
from spacy.lang.en import English
from time import time

stemmer = PorterStemmer()
rouge = Rouge()
tokenizer = English().Defaults.create_tokenizer()

def compute_rouge_eli5(compare_list):
    preds = [" ".join([stemmer.stem(str(w))
                       for w in tokenizer(pred)])
             for gold, pred in compare_list]
    golds = [" ".join([stemmer.stem(str(w))
                       for w in tokenizer(gold)])
             for gold, pred in compare_list]
    scores = rouge.get_scores(preds, golds, avg=True)
    return scores

compare_list = [(g, p) for p, g in zip(predicted, reference)]
scores = compute_rouge_eli5(compare_list)
df = pd.DataFrame({
    'rouge1': [scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f']],
    'rouge2': [scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f']],
    'rougeL': [scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f']],
}, index=[ 'P', 'R', 'F'])
df.style.format({'rouge1': "{:.4f}", 'rouge2': "{:.4f}", 'rougeL': "{:.4f}"})
rouge1 rouge2 rougeL
P 0.3254 0.0680 0.3251
R 0.3118 0.0631 0.2560
F 0.2729 0.0551 0.2583