Diagnostics documentation

What is a diagnostics dataset?

The SuperGLUE project begins with a examinational dataset representing various linguistic phenomena - Diagnostic Dataset. This dataset evaluates understanding of a sentence through natural language inference (NLI) problem.

The NLI task is well suited for model diagnostic purposes because it can encompass a wide range of skills related to general understanding of the language, from resolving syntactic ambiguity to high-level reasoning, while supporting a straightforward binary evaluation.

In this notebook, we will examine the components of the diagnostic data set and explore the difference in the basic model performance on Russian (translated) and the original English dataset.

The dataset is fully compatible with original English Diagnostic dataset: https://super.gluebenchmark.com/diagnostics

The data consists of several hundred sentence pairs labeled with their entailment relations (entailment or not_entailment) in both directions and tagged with a set of linguistic phenomena involved in justifying the entailment labels.

It was constructed manually by the authors of GLUE and draws text from several different sources, including news, academic and encyclopedic text, and social media. The presented here datasets are the direct analogue of original SuperGLUE.

The sentence pairs were crafted so that each sentence in a pair is very similar to the other, to make the problem harder for systems that rely on simple lexical cues and statistics.

Linguistic phenomena are tagged with labels coarse- and fine-grained categories. The coarse-grained categories are Lexical Semantics, Predicate-Argument Structure, Logic, and Knowledge and Common Sense. Each of these has several fine-grained subcategories for specific linguistic phenomena of that kind. See the Linguistic Categorization section for details on the meanings of all of the categories and how they are annotated.

Contents

  1. Data Format
  2. Exploring English and Russian data
  3. Meaning of the categories:
    • Entailment
    • Lexical semantics
    • Logic
    • Predicate-argument structure
    • Knowledge
  4. Basic Model Performance

Data Format

Each example test in the diagnostic dataset has same structure:

{
 'sentence1': "The cat sat on the mat.",
 'sentence2': "The cat did not sit on the mat.",
 'label': 'not_entailment',
 'knowledge': '',
 'lexical-semantics': '',
 'logic': 'Negation',
 'predicate-argument-structure': ''
}

where

  1. "sentence1" and "sentence2" - stand for two sentences to be compared
  2. "Label" - stands for entailment label - whether between 2 sentences there is an entailment relation
  3. "Knowledge" includes tags for marking commonsense and world knowledge in examples
  4. "Lexical-semantics" stands for tag listing on semantic features also expressed lexically - types of negation, quantifiers, factivity, etc.
  5. "Logic" includes various natural language logic operators - conditionality, temporality, discuntions, etc.
  6. "Predicate-argument-structure" section includes morphosyntactic features of the example sentences concerning predicate-argument structure - verb voice, syntactic shifts, etc.

See below to explore all the tag variety and linguistic specification.

In this example Notebook the datasets are also enriched with two particular tags:

  1. 'prediction': 'entailment'|'not_entailment' - whether the model is predicting entailment between 2 sentences or not
  2. 'equal': True|False - whether the model prediction is equal to golden label or not

These tags are added for the further convenience of the analytics - feel free to reuse it when evaluating your own model [Link to the code]

Exploring English and Russian data

imports

In [20]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')


%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
#графики в svg выглядят более четкими
%config InlineBackend.figure_format = 'svg'

#увеличим дефолтный размер графиков
from pylab import rcParams
rcParams['figure.figsize'] = 8, 5
import pandas as pd

Comparison of SuperGlue results on a diagnostic dataset

English Model - bert-large-cased, trained on RTE.

Russian Model - DeepPavlov/rubert-base-cased-conversational, trained on Russian RTE (short version of 2000 examples).

In [21]:
rus_df = pd.read_csv('AX-b-rus-pred.csv', index_col = 'idx').fillna('')
en_df = pd.read_csv('AX-b-en-pred.csv', index_col = 'idx').fillna('')
print('Russian accuracy: ', round(rus_df.equal.sum()/rus_df.shape[0],3))
print('English accuracy: ', round(en_df.equal.sum()/en_df.shape[0],3))
print('\nEntailment percent (label)', round(rus_df.label.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (Russian)', round(rus_df.prediction.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (English)', round(en_df.prediction.value_counts()['entailment']/en_df.shape[0],3))
Russian accuracy:  0.589
English accuracy:  0.539

Entailment percent (label) 0.416
Entailment percent (Russian) 0.381
Entailment percent (English) 0.784
In [22]:
en_df.head()
Out[22]:
knowledge label lexical-semantics logic predicate-argument-structure sentence1 sentence2 prediction equal
idx
0 not_entailment Negation The cat sat on the mat. The cat did not sit on the mat. entailment False
1 not_entailment Negation The cat did not sit on the mat. The cat sat on the mat. entailment False
2 not_entailment Negation When you've got no snow, it's really hard to l... When you've got snow, it's really hard to lear... entailment False
3 not_entailment Negation When you've got snow, it's really hard to lear... When you've got no snow, it's really hard to l... entailment False
4 not_entailment Negation Out of the box, Ouya supports media apps such ... Out of the box, Ouya doesn't support media app... entailment False
In [23]:
rus_df.head()
Out[23]:
knowledge label lexical-semantics logic predicate-argument-structure sentence1 sentence2 prediction equal
idx
0 not_entailment Negation Кошка сидела на коврике. Кошка не сидела на коврике. not_entailment True
1 not_entailment Negation Кошка не сидела на коврике. Кошка сидела на коврике. not_entailment True
2 not_entailment Negation Когда у вас нет снега, очень сложно обучиться ... Когда у вас есть снег, очень сложно обучиться ... not_entailment True
3 not_entailment Negation Когда у вас есть снег, очень сложно обучиться ... Когда у вас нет снега, очень сложно обучиться ... not_entailment True
4 not_entailment Negation Сразу же после распаковки Ouya поддерживает му... Сразу же после распаковки Ouya не поддерживает... not_entailment True

Meaning of the categories

with examples

Entailment

= Natural Language Inference

Sources:

  1. Natural Language Inference - MultiNLI
  2. Recognizing Textual Entailment - RTE

In general, we regard the NLI problem as one of judging what a typical human reader would conclude to be true upon reading the premise, absent the effects of pragmatics. Inevitably there will be many cases which are not purely, literally implied, but we want to build systems that will be able to draw the same conclusions as humans. Especially in the case of commonsense reasoning, which often relies on defeasible inference, this will be the case. We try to exclude particularly questionable cases from the data, and we do not use any sentences that are ungrammatical or semantically incoherent. In general, we use the standards set in the RTE Challenges and follow the guidelines of MultiNLI.

Given two sentences (a premise and hypothesis), we label them with one of two entailment relations:

  • Entailment: the hypothesis states something that is definitely correct about the situation or event in the premise.
  • Not Entailment: the hypothesis states something that might be correct or is definitely incorrect about the situation or event in the premise.

    These definitions are essentially the same as what was provided to the crowdsourced annotators of the MultiNLI dataset. They rely on an assumption that the two sentences describe the same situation. However, a \"situation\" may involve multiple participants and actions, and the granularity at which we ask the described situations to be the same is somewhat subjective. The remainder of this section describes the decisions we made when constructing the diagnostic dataset to decide these issues.

In [24]:
#English data
en_df['label'].value_counts()
Out[24]:
not_entailment    644
entailment        460
Name: label, dtype: int64
In [25]:
# Examples - Entailment

index = 3
d = en_df[en_df['label']=='entailment'].iloc[index].to_dict()
print('English Entailment:\n')
print(d['sentence1'])
print(d['sentence2'])
print('\nRussian Entailment:\n')
d = rus_df[rus_df['label']=='entailment'].iloc[index].to_dict()
print(d['sentence1'])
print(d['sentence2'])
English Entailment:

Writing Java is not too different from programming with handcuffs.
Writing Java is similar to programming with handcuffs.

Russian Entailment:

Написание кода на Java не слишком отличается от программирования в наручниках.
Написание кода на Java подобно программированию в наручниках.
In [26]:
# Examples - Not Entailment

index = 3
d = en_df[en_df['label']=='not_entailment'].iloc[index].to_dict()
print('English - Not Entailment:\n')
print(d['sentence1'])
print(d['sentence2'])
print('\nRussian - Not Entailment:\n')
d = rus_df[rus_df['label']=='not_entailment'].iloc[index].to_dict()
print(d['sentence1'])
print(d['sentence2'])
English - Not Entailment:

When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.
When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.

Russian - Not Entailment:

Когда у вас есть снег, очень сложно обучиться зимним видам спорта, так что мы искали все способы изучить без снега то, что я мог бы потом повторить на снегу.
Когда у вас нет снега, очень сложно обучиться зимним видам спорта, так что мы искали все способы изучить без снега то, что я мог бы потом повторить на снегу.

Lexical semantics

These phenomena center on aspects of word meaning.

In [27]:
en_df['lexical-semantics'].value_counts()
Out[27]:
                                  736
Lexical entailment                134
Factivity                          64
Quantifiers                        46
Named entities                     36
Symmetry/Collectivity              28
Redundancy                         26
Morphological negation             26
Lexical entailment;Quantifiers      4
Lexical entailment;Factivity        2
Factivity;Quantifiers               2
Name: lexical-semantics, dtype: int64

Lexical Entailment

Entailment can be applied not only on the sentence level, but the word level. For example, we say dog lexically entails animal because anything that is a dog is also an animal, and dog lexically contradicts cat because it is impossible to be both at once. This applies to all kinds of words (nouns, adjectives, verbs, many prepositions, etc.) and the relationship between lexical and sentential entailment has been deeply explored, e.g., in systems of Natural Logic. This connection often hinges on monotonicity in language, so many Lexical Entailment examples will also be tagged with one of the Monotone categories, though we do not do this in every case (see Definite Descriptions and Monotonicity).

Falcon Heavy is the smallest rocket since NASAs Saturn V booster, which was used for the Moon missions in the 1970s.
Falcon Heavy is the largest rocket since NASAs Saturn V booster, which was used for the Moon missions in the 1970s.

Morphological Negation

This is a special case of lexical contradiction where one word is derived from the other: from affordable to unaffordable, agree to disagree, etc. We also include examples like ever and never. We also label these examples with Negation or Double Negation, since they can be viewed as involving a word-level logical negation.

{% trans 'Brexit is a reversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week.
Brexit is an irreversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week.' %}

Factivity

Propositions appearing in a sentence may be in any entailment relation with the sentence as a whole, depending on the context in which they appear.

All speech is political speech.
  Joan doubts that all speech is political speech.

In many cases, this is determined by lexical triggers (usually verbs or adverbs) in the sentence. For example,

  • I recognize that X entails X
  • I did not recognize that X entails X
  • I believe that X does not entail X
  • I am refusing to do X contradicts I am doing X
  • I am not refusing to do X does not contradict I am doing X
  • I almost finished X contradicts I finished X
  • I barely finished X entails I finished X Constructions like the one with recognize are often called factive, since the entailment (of X above, regarded as a presupposition) persists even under negation. Constructions like the one with refusing above are often called implicative, and are sensitive to negation. There are also cases where a sentence (non-)entails the existence of an entity mentioned in it, e.g.,
  • "I have found a unicorn" entails "A unicorn exists"
  • "I am looking for a unicorn" does not necessarily entail "A unicorn exists" Readings where the entity does not necessarily exist are often called intensional readings, since they seem to deal with the properties denoted by a description (its intension) rather than being reducible to the set of entities that match the description (its extension, which in cases of non-existence will be empty). We place all examples involving these phenomena under the label of Factivity. While it often depends on context to determine whether a nested proposition or existence of an entity is entailed by the overall statement, very often it relies heavily on lexical triggers, so we place the category under Lexical Semantics.

Symmetry/Collectivity

Some propositions denote symmetric relations, while others do not; e.g.,

  • "John married Gary" entails "Gary married John"
  • "John likes Gary" does not entail "Gary likes John"

For symmetric relations, they can often be rephrased by collecting both arguments into the subject:

  • "John met Gary" entails "John and Gary met"

    Whether a relation is symmetric, or admits collecting its arguments into the subject, is often determined by its head word (e.g., like, marry or meet), so we classify it under Lexical Semantics.

    Republican lawmakers will ask President Trump to use a controversial White House framework as the baseline for a coming Senate debate on immigration policy.
    President Trump will ask Republican lawmakers to use a controversial White House framework as the baseline for a coming Senate debate on immigration policy.

Redundancy

If a word can be removed from a sentence without changing its meaning, that means the words meaning was more-or-less adequately expressed by the sentence; so, identify these cases reflects an understanding of both lexical and sentential semantics.

Tom and Adam were whispering loudly in the theater.
Tom and Adam were whispering in the theater.

Named Entities

Words often name entities that exist out in the world. There are many different kinds of understanding we might wish to understand about these names, including their compositional structure (for example, the Baltimore Police is the same as the Police of the City of Baltimore) or their real-world referents and acronym expansions (for example, SNL is Saturday Night Live). This category is closely related to World Knowledge, but focuses on the semantics of names as lexical items rather than background knowledge about their denoted entities.

The sides came to an agreement after their meeting in Europe.
The sides came to an agreement after their meeting in Stockholm.

Quantifiers

Logical quantification in natural language is often expressed through lexical triggers such as every, most, some, and no. While we reserve the categories in Quantification and Monotonicity for entailments involving operations on these quantifiers and their arguments, we choose to regard the interchangeability of quantifiers (e.g., in many cases most entails many) as a question of lexical semantics.

We consider all context words as positive examples and sample many negatives at random from the dictionary.
We consider some context words as positive examples and sample negatives at random from the dictionary.

Logic

Once you understand the structure of a sentence, there is often a baseline set of shallow conclusions you can draw using logical operators. There is a long tradition of modeling natural language semantics using the mathematical tools of logic. Indeed, the development of mathematical logic was initially by questions about natural language meaning, from Aristotelian syllogisms to Fregean symbols. The notion of entailment is also borrowed from mathematical logic. So it is no surprise that logic plays an important role in natural language inference.

In [28]:
en_df['logic'].value_counts()
Out[28]:
                                          740
Negation                                   54
Upward monotone                            30
Intervals/Numbers                          30
Temporal                                   28
Double negation                            26
Downward monotone                          26
Conjunction                                24
Disjunction                                22
Non-monotone                               22
Conditionals                               22
Universal                                  14
Existential                                14
Disjunction;Negation                        6
Conjunction;Negation                        6
Intervals/Numbers;Non-monotone              6
Disjunction;Conditionals;Negation           4
Disjunction;Conjunction                     4
Negation;Conditionals                       4
Temporal;Conjunction                        2
Downward monotone;Conditionals              2
Universal;Negation                          2
Downward monotone;Existential;Negation      2
Double negation;Negation                    2
Conjunction;Upward monotone                 2
Existential;Negation                        2
Disjunction;Non-monotone                    2
Universal;Conjunction                       2
Existential;Upward monotone                 2
Temporal;Intervals/Numbers                  2
Name: logic, dtype: int64

Propositional Structure

Negation, Double Negation, Conjunction, Disjunction, Conditionals

All of the basic operations of propositional logic appear in natural language, and we tag them where they are relevant to our examples:

Negation:The cat sat on the mat contradicts The cat did not sit on the mat.

When you have got snow, it is really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.
When you have got no snow, it is really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.

Double negation:The market is not impossible to navigate entails The market is possible to navigate.

The market is about to get harder, but possible to navigate.
The market is about to get harder, but not impossible to navigate

Conjunction:Temperature and snow consistency must be just right entails Temperature must be just right.

The patient bears some responsibility for successful care.
Both doctor and patient bear some responsibility for successful care.

Disjunction:Life is either a daring adventure or nothing at all does not entail, but is entailed by, Life is a daring adventure.

He has a blind trust.
Either he has a blind trust or he has a conflict of interest.

Conditionals:If both apply, they are essentially impossible does not entail They are essentially impossible. Conditionals are a little bit more complicated because their use in language does not always mirror their meaning in logic. For example, they may be used at a higher level than the at-issue assertion: If you think about it, it is the perfect reverse psychology tactic entails It is the perfect reverse psychology tactic

Pedro does not have a donkey.
If Pedro has a donkey, then he beats it.

Quantifications

Universal, Existential Quantifiers are often triggered by words such as all, some, many, and no. There is a rich body of work modeling their meaning in mathematical logic with generalized quantifiers. In these two categories, we focus on straightforward inferences from the natural language analogs of universal and existential quantification:

Universal:All parakeets have two wings entails, but is not entailed by My parakeet has two wings.```

No one has a set of principles to live by.
Everyone has a set of principles to live by.

Existential:Some parakeets have two wings does not entail, but is entailed by My parakeet has two wings.```

No one knows how turtles reproduce.
Susan knows how turtles reproduce.

Monotonicity

Upward Monotone, Downward Monotone, Non-Monotone

Monotonicity is a property of argument positions in certain logical systems. In general, it gives a way of deriving entailment relations between expressions that differ on only one subexpression. In language, it can explain how some entailments propagate through logical operators and quantifiers. For example, note that pet entails pet squirrel, which further entails happy pet squirrel. We can demonstrate how the quantifiers a, no and exactly one differ with respect to monotonicity:

  • "I have a pet squirrel" entails "I have a pet", but not "I have a happy pet squirrel".
  • "I have no pet squirrels" does not entail "I have no pets", but does entail "I have no happy pet squirrels".
  • "I have exactly one pet squirrel" entails neither "I have exactly one pet" nor "I have exactly one happy pet squirrel".

In all of these examples, the pet squirrel appears in what we call the restrictor position of the quantifier. We say: a is upward monotone in its restrictor: an entailment in the restrictor yields an entailment of the whole statement. no is downward monotone in its restrictor: an entailment in the restrictor yields an entailment of the whole statement in the opposite direction. exactly one is non-monotone in its restrictor: entailments in the restrictor do not yield entailments of the whole statement.

In this way, entailments between sentences that are built off of entailments of sub-phrases almost always rely on monotonicity judgments; see, for example, Lexical Entailment. However, because this is such a general class of sentence pairs, to keep the Logic category meaningful we do not always tag these examples with monotonicity; see Definite Descriptions and Monotonicity for details. To draw an analogy, these types of monotonicity are closely related to covariance, contravariance, and invariance of type arguments in programming languages with subtyping.

Richer Logical Structure:

Intervals/Numbers, Temporal There are some higher-level facets of reasoning that have been traditionally modeled using logic; these include actual mathematical reasoning (entailments based off of numbers) and temporal reasoning (which is often modeled as reasoning about a mathematical timeline).

Intervals/Numbers:I have had more than 2 drinks tonight entails I have had more than 1 drink tonight.

I failed my resolutions in 1995.
I have failed my resolutions every year since 1997, and it is now 2008.

Temporal:Mary left before John entered entails John entered after Mary left.

John entered after Mary left.
Mary left before John entered.

Predicate-argument structure

An important component of understanding the meaning of a sentence is understanding how its parts are composed together into a whole. In this category, we address issues across that spectrum, from syntactic ambiguity to semantic roles and coreference.

In [29]:
en_df['predicate-argument-structure'].value_counts()
Out[29]:
                                              680
Prepositional phrases                          56
Core args                                      48
Intersectivity                                 44
Anaphora/Coreference                           42
Coordination scope                             34
Active/Passive                                 32
Ellipsis/Implicits                             28
Nominalization                                 26
Relative clauses                               24
Datives                                        20
Genitives/Partitives                           18
Restrictivity                                  18
Coordination scope;Prepositional phrases        6
Core args;Anaphora/Coreference                  4
Relative clauses;Restrictivity                  4
Anaphora/Coreference;Prepositional phrases      4
Ellipsis/Implicits;Anaphora/Coreference         4
Restrictivity;Anaphora/Coreference              2
Nominalization;Genitives/Partitives             2
Relative clauses;Anaphora/Coreference           2
Restrictivity;Relative clauses                  2
Active/Passive;Prepositional phrases            2
Intersectivity;Ellipsis/Implicits               2
Name: predicate-argument-structure, dtype: int64

Syntactic Ambiguity: Relative Clauses, Coordination Scope

These two categories deal purely with resolving syntactic ambiguity. Relative clauses and coordination scope are both sources of a great amount of ambiguity in English.

Mao was chairman of the Communist Party from before its accession to power in 1949 until his death in 1976.
The move marks an end to a system put in place by Deng Xiaoping in the 1980s to prevent the rise of another Mao, who was chairman of the Communist Party from before its accession to power in 1949 until his death in 1976.

Prepositional phrases

Prepositional phrase attachment is a particularly difficult problem that syntactic parsers in NLP systems continue to struggle with. We view it as a problem both of syntax and semantics, since prepositional phrases can express a wide variety of semantic roles and often semantically apply beyond their direct syntactic attachment.

On Sunday, Jane had a party.
Jane had a party on Sunday.

Core Arguments

Verbs select for particular arguments, especially as their subject and object, which might be interchangeable depending on the context or the surface form. One example is the ergative alternation:

  • "Jake broke the vase" entails "the vase broke".
  • "Jake broke the vase" does not entail "Jake broke".

Other rearrangements of core arguments, such as those seen in Symmetry/Collectivity, also fall under the Core Arguments label.

Alternations: Active/Passive, Genitives/Partitives, Nominalization, Datives

All four of these categories correspond to syntactic alternations that are known to follow specific patterns in English:

  • Active/Passive: I saw him is equivalent to He was seen by me and entails He was seen.
  • Genitives/Partitives: the elephants foot is the same thing as the foot of the elephant.
  • Nominalization: I caused him to submit his resignation entails I caused the submission of his resignation.v
  • Datives: I baked him a cake entails I baked a cake for him and I baked a cake but not I baked him.

Ellipsis/Implicits

Often, the argument of a verb or other predicate is omitted (elided) in the text, with the reader filling in the gap. We can construct entailment examples by explicitly filling in the gap with the correct or incorrect referents. For example:

  • Premise: Putin is so entrenched within Russia’s ruling system that many of its members can imagine no other leader.
  • Entails: Putin is so entrenched within Russia’s ruling system that many of its members can imagine no other leader than Putin.
  • Contradicts: Putin is so entrenched within Russias ruling system that many of its members can imagine no other leader than themselves. This is often regarded as a special case of anaphora, but we decided to split out these cases from explicit anaphora, which is often also regarded as a case of coreference (and attempted to some degree in modern coreference resolution systems).

Anaphora/Coreference

Coreference refers to when multiple expressions refer to the same entity or event. It is closely related to Anaphora, where the meaning of an expression depends on another (antecedent) expression in context. These phenomena have significant overlap, for example, with pronouns (she, we, it), which are anaphors that are co-referent with their antecedents. However, they also may occur independently, for example, coreference between two definite noun phrases (e.g., Theresa May and the British Prime Minister) that refer to the same entity, or anaphora from a word like other which requires an antecedent to distinguish something from. In this category we only include cases where there is an explicit phrase (anaphoric or not) that is co-referent with an antecedent or other phrase. We construct examples for these in much the same way as for Ellipsis/Implicits.

George fell into the water.
George went to the lake to catch a fish, but he fell into the water.

Intersectivity

Many modifiers, especially adjectives, allow non-intersective uses, which affect their entailment behavior. For example:

  • Intersective: He is a violinist and an old surgeon entails He is an old violinist and He is a surgeon
  • Non-intersective: He is a violinist and a skilled surgeon does not entail He is a skilled violinist
  • Non-intersective: He is a fake surgeon does not entail He is a surgeon

    Generally, an intersective use of a modifier, like old in old men, is one which may be interpreted as referring to the set of entities with both properties (they are old and they are men). Linguists often formalize this using set intersection, hence the name. It is related to Factivity; for example fake may be regarded as a counter-implicative modifier, and these examples will be labeled as such. However, we choose to categorize intersectivity under predicate-argument structure rather than lexical semantics, because generally the same word will admit both intersective and non-intersective uses, so it may be regarded as an ambiguity of argument structure.

Restrictivity

Restrictivity is most often used to refer to a property of uses of noun modifiers; in particular, a restrictive use of a modifier is one that serves to identify the entity or entities being described, whereas a non-restrictive use adds extra details to the identified entity. The distinction can often be highlighted by entailments:

  • Restrictive: I finished all of my homework due today does not entail I finished all of my homework
  • Non-restrictive: I got rid of all those pesky bedbugs entails I got rid of all those bedbugs.

Modifiers that are commonly used non-restrictively are appositives, relative clauses starting with which or who (although these can be restrictive, despite what your English teacher might tell you), and expletives (e.g. pesky). However, non-restrictive uses can appear in many forms.

Ambiguity in restrictivity is often employed in certain kinds of jokes (warning: language).

Knowledge

Strictly speaking, world knowledge and common sense are required on every level of language understanding, for disambiguating word senses, syntactic structures, anaphora, and more. So our entire suite (and any test of entailment) does test these features to some degree. However, in these categories, we gather examples where the entailment rests not only on correct disambiguation of the sentences, but also application of extra knowledge, whether it is concrete knowledge about world affairs or more common-sense knowledge about word meanings or social or physical dynamics.

In [30]:
en_df['knowledge'].value_counts()
Out[30]:
                   820
Common sense       150
World knowledge    134
Name: knowledge, dtype: int64

World Knowledge

In this category we focus on knowledge that can clearly be expressed as facts, as well as broader and less common geographical, legal, political, technical, or cultural knowledge. Examples:

  • "This is the most oniony article I have seen on the entire internet" entails "This article reads like satire".
  • "The reaction was strongly exothermic" entails "The reaction media got very hot".
  • "There are amazing hikes around Mt. Fuji" entails "There are amazing hikes in Japan" but not "There are amazing hikes in Nepal".

Common Sense

In this category we focus on knowledge that is more difficult to express as facts and that we expect to be possessed by most people independent of cultural or educational background. This includes a basic understanding of physical and social dynamics as well as lexical meaning (beyond simple lexical entailment or logical relations). Examples:

  • "The announcement of Tillersons departure sent shock waves across the globe" contradicts "People across the globe were prepared for Tillersons departure".
  • "Marc Sims has been seeing his barber once a week, for several years" entails "Marc Sims has been getting his hair cut once a week, for several years".
  • "Hummingbirds are really attracted to bright orange and red (hence why the feeders are usually these colours)" entails "The feeders are usually coloured so as to attract hummingbirds".

Basic Model Performance, by category

In [31]:
print('Russian accuracy: ', round(rus_df.equal.sum()/rus_df.shape[0],3))
print('English accuracy: ', round(en_df.equal.sum()/en_df.shape[0],3))
print('\nEntailment percent (label)', round(rus_df.label.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (Russian)', round(rus_df.prediction.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (English)', round(en_df.prediction.value_counts()['entailment']/en_df.shape[0],3))
Russian accuracy:  0.589
English accuracy:  0.539

Entailment percent (label) 0.416
Entailment percent (Russian) 0.381
Entailment percent (English) 0.784

The accuracy of the Russian model is higher than that of the English one. In addition, the distribution of the labels of the Russian model (38.1% Entailment) is also significantly closer to the original (41.6% Entailment) than the English (78.4%).

The English model has a significant bias towards Entailment, it is predicted almost twice as often as necessary. The Russian model does not observe this, on the contrary, it slightly lacks Entailment.

Conclusion: the distribution of labels in the Russian model is much closer to the original, respectively, and the accuracy is higher.

Further, the Russian and English models were compared with each other on three grounds: logic, lexical-semantics and predicate-argument-structure. At the same time, for each characteristic, only those groups were left in the analysis, in which the number of examples is more than 10 (since it is not quite correct to count statisticians with a very small number of examples, they may turn out to be not indicative and even misleading).

Accuracy by category

In [32]:
def group_dataframe_by_label(df, label):
    df_short = df[df.logic.isin(df[label].value_counts().head(13).index)]
    df_short.equal = df_short.equal.map({False:0, True:1})
    df_short.label = df_short.label.map({'not_entailment':0, 'entailment':1})
    df_short.prediction = df_short.prediction.map({'not_entailment':0, 'entailment':1})
    df_short['ss'] = [1] *df_short.shape[0]
    res_df = df_short[[label,'equal','label','prediction', 'ss']].groupby(label).sum()
    for col in ['equal', 'label', 'prediction']:
        res_df[col] = res_df[col]/res_df.ss
    res_df.sort_values('equal', ascending = False)
    return res_df

Logic

English Diagnostics

In [33]:
res_df = group_dataframe_by_label(en_df, 'logic')
sns.heatmap(res_df[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5248d4f898>

Russian Diagnostics

In [34]:
sns.heatmap(group_dataframe_by_label(rus_df, 'logic')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5248da9be0>

Logic

The Russian BERT model did much better with Negation, Disjunction and Non-monotone, but completely dipped in Double negation. The latter possibly due to the peculiarities of the Russian language.

Lexical-semantics

English Diagnostics

In [35]:
sns.heatmap(group_dataframe_by_label(en_df, 'lexical-semantics')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5248b84208>

Russian Diagnostics

In [36]:
sns.heatmap(group_dataframe_by_label(rus_df, 'lexical-semantics')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5248a9f128>

Lexical-semantics

The Russian BERT model performs much better than the English one with Quantifiers and Factivity, but could not cope with Morphological negation and Named entities.

Predicate-argument structure

English Diagnostics

In [37]:
sns.heatmap(group_dataframe_by_label(en_df, 'predicate-argument-structure')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f52489e7278>

Russian Diagnostics

In [38]:
sns.heatmap(group_dataframe_by_label(rus_df, 'predicate-argument-structure')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f52488f0a90>

Predicate-argument structure

Of the most notable, the Russian BERT completely fails with Restrictivity while the English model shows a good result.

Knowledge

English Diagnostics

In [21]:
sns.heatmap(group_dataframe_by_label(en_df, 'knowledge')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f05490ac048>

Russian Diagnostics

In [20]:
sns.heatmap(group_dataframe_by_label(rus_df, 'knowledge')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0549186748>
In [ ]: