The SuperGLUE project begins with a examinational dataset representing various linguistic phenomena - Diagnostic Dataset. This dataset evaluates understanding of a sentence through natural language inference (NLI) problem.
The NLI task is well suited for model diagnostic purposes because it can encompass a wide range of skills related to general understanding of the language, from resolving syntactic ambiguity to high-level reasoning, while supporting a straightforward binary evaluation.
In this notebook, we will examine the components of the diagnostic data set and explore the difference in the basic model performance on Russian (translated) and the original English dataset.
The dataset is fully compatible with original English Diagnostic dataset: https://super.gluebenchmark.com/diagnostics
The data consists of several hundred sentence pairs labeled with their entailment relations (entailment or not_entailment) in both directions and tagged with a set of linguistic phenomena involved in justifying the entailment labels.
It was constructed manually by the authors of GLUE and draws text from several different sources, including news, academic and encyclopedic text, and social media. The presented here datasets are the direct analogue of original SuperGLUE.
The sentence pairs were crafted so that each sentence in a pair is very similar to the other, to make the problem harder for systems that rely on simple lexical cues and statistics.
Linguistic phenomena are tagged with labels coarse- and fine-grained categories. The coarse-grained categories are Lexical Semantics, Predicate-Argument Structure, Logic, and Knowledge and Common Sense. Each of these has several fine-grained subcategories for specific linguistic phenomena of that kind. See the Linguistic Categorization section for details on the meanings of all of the categories and how they are annotated.
Each example test in the diagnostic dataset has same structure:
{
'sentence1': "The cat sat on the mat.",
'sentence2': "The cat did not sit on the mat.",
'label': 'not_entailment',
'knowledge': '',
'lexical-semantics': '',
'logic': 'Negation',
'predicate-argument-structure': ''
}
where
See below to explore all the tag variety and linguistic specification.
In this example Notebook the datasets are also enriched with two particular tags:
These tags are added for the further convenience of the analytics - feel free to reuse it when evaluating your own model [Link to the code]
imports
import pandas as pd
import warnings
warnings.simplefilter('ignore')
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
#графики в svg выглядят более четкими
%config InlineBackend.figure_format = 'svg'
#увеличим дефолтный размер графиков
from pylab import rcParams
rcParams['figure.figsize'] = 8, 5
import pandas as pd
English Model - bert-large-cased, trained on RTE.
Russian Model - DeepPavlov/rubert-base-cased-conversational, trained on Russian RTE (short version of 2000 examples).
rus_df = pd.read_csv('AX-b-rus-pred.csv', index_col = 'idx').fillna('')
en_df = pd.read_csv('AX-b-en-pred.csv', index_col = 'idx').fillna('')
print('Russian accuracy: ', round(rus_df.equal.sum()/rus_df.shape[0],3))
print('English accuracy: ', round(en_df.equal.sum()/en_df.shape[0],3))
print('\nEntailment percent (label)', round(rus_df.label.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (Russian)', round(rus_df.prediction.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (English)', round(en_df.prediction.value_counts()['entailment']/en_df.shape[0],3))
en_df.head()
rus_df.head()
= Natural Language Inference
Sources:
In general, we regard the NLI problem as one of judging what a typical human reader would conclude to be true upon reading the premise, absent the effects of pragmatics. Inevitably there will be many cases which are not purely, literally implied, but we want to build systems that will be able to draw the same conclusions as humans. Especially in the case of commonsense reasoning, which often relies on defeasible inference, this will be the case. We try to exclude particularly questionable cases from the data, and we do not use any sentences that are ungrammatical or semantically incoherent. In general, we use the standards set in the RTE Challenges and follow the guidelines of MultiNLI.
Given two sentences (a premise and hypothesis), we label them with one of two entailment relations:
Not Entailment: the hypothesis states something that might be correct or is definitely incorrect about the situation or event in the premise.
These definitions are essentially the same as what was provided to the crowdsourced annotators of the MultiNLI dataset. They rely on an assumption that the two sentences describe the same situation. However, a \"situation\" may involve multiple participants and actions, and the granularity at which we ask the described situations to be the same is somewhat subjective. The remainder of this section describes the decisions we made when constructing the diagnostic dataset to decide these issues.
#English data
en_df['label'].value_counts()
# Examples - Entailment
index = 3
d = en_df[en_df['label']=='entailment'].iloc[index].to_dict()
print('English Entailment:\n')
print(d['sentence1'])
print(d['sentence2'])
print('\nRussian Entailment:\n')
d = rus_df[rus_df['label']=='entailment'].iloc[index].to_dict()
print(d['sentence1'])
print(d['sentence2'])
# Examples - Not Entailment
index = 3
d = en_df[en_df['label']=='not_entailment'].iloc[index].to_dict()
print('English - Not Entailment:\n')
print(d['sentence1'])
print(d['sentence2'])
print('\nRussian - Not Entailment:\n')
d = rus_df[rus_df['label']=='not_entailment'].iloc[index].to_dict()
print(d['sentence1'])
print(d['sentence2'])
These phenomena center on aspects of word meaning.
en_df['lexical-semantics'].value_counts()
Entailment can be applied not only on the sentence level, but the word level. For example, we say dog lexically entails animal because anything that is a dog is also an animal, and dog lexically contradicts cat because it is impossible to be both at once. This applies to all kinds of words (nouns, adjectives, verbs, many prepositions, etc.) and the relationship between lexical and sentential entailment has been deeply explored, e.g., in systems of Natural Logic. This connection often hinges on monotonicity in language, so many Lexical Entailment examples will also be tagged with one of the Monotone categories, though we do not do this in every case (see Definite Descriptions and Monotonicity).
Falcon Heavy is the smallest rocket since NASAs Saturn V booster, which was used for the Moon missions in the 1970s.
Falcon Heavy is the largest rocket since NASAs Saturn V booster, which was used for the Moon missions in the 1970s.
This is a special case of lexical contradiction where one word is derived from the other: from affordable to unaffordable, agree to disagree, etc. We also include examples like ever and never. We also label these examples with Negation or Double Negation, since they can be viewed as involving a word-level logical negation.
{% trans 'Brexit is a reversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week.
Brexit is an irreversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week.' %}
Propositions appearing in a sentence may be in any entailment relation with the sentence as a whole, depending on the context in which they appear.
All speech is political speech.
Joan doubts that all speech is political speech.
In many cases, this is determined by lexical triggers (usually verbs or adverbs) in the sentence. For example,
Some propositions denote symmetric relations, while others do not; e.g.,
For symmetric relations, they can often be rephrased by collecting both arguments into the subject:
"John met Gary" entails "John and Gary met"
Whether a relation is symmetric, or admits collecting its arguments into the subject, is often determined by its head word (e.g., like, marry or meet), so we classify it under Lexical Semantics.
Republican lawmakers will ask President Trump to use a controversial White House framework as the baseline for a coming Senate debate on immigration policy.
President Trump will ask Republican lawmakers to use a controversial White House framework as the baseline for a coming Senate debate on immigration policy.
If a word can be removed from a sentence without changing its meaning, that means the words meaning was more-or-less adequately expressed by the sentence; so, identify these cases reflects an understanding of both lexical and sentential semantics.
Tom and Adam were whispering loudly in the theater.
Tom and Adam were whispering in the theater.
Words often name entities that exist out in the world. There are many different kinds of understanding we might wish to understand about these names, including their compositional structure (for example, the Baltimore Police is the same as the Police of the City of Baltimore) or their real-world referents and acronym expansions (for example, SNL is Saturday Night Live). This category is closely related to World Knowledge, but focuses on the semantics of names as lexical items rather than background knowledge about their denoted entities.
The sides came to an agreement after their meeting in Europe.
The sides came to an agreement after their meeting in Stockholm.
Logical quantification in natural language is often expressed through lexical triggers such as every, most, some, and no. While we reserve the categories in Quantification and Monotonicity for entailments involving operations on these quantifiers and their arguments, we choose to regard the interchangeability of quantifiers (e.g., in many cases most entails many) as a question of lexical semantics.
We consider all context words as positive examples and sample many negatives at random from the dictionary.
We consider some context words as positive examples and sample negatives at random from the dictionary.
Once you understand the structure of a sentence, there is often a baseline set of shallow conclusions you can draw using logical operators. There is a long tradition of modeling natural language semantics using the mathematical tools of logic. Indeed, the development of mathematical logic was initially by questions about natural language meaning, from Aristotelian syllogisms to Fregean symbols. The notion of entailment is also borrowed from mathematical logic. So it is no surprise that logic plays an important role in natural language inference.
en_df['logic'].value_counts()
Negation, Double Negation, Conjunction, Disjunction, Conditionals
All of the basic operations of propositional logic appear in natural language, and we tag them where they are relevant to our examples:
Negation:The cat sat on the mat contradicts The cat did not sit on the mat.
When you have got snow, it is really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.
When you have got no snow, it is really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.
Double negation:The market is not impossible to navigate entails The market is possible to navigate.
The market is about to get harder, but possible to navigate.
The market is about to get harder, but not impossible to navigate
Conjunction:Temperature and snow consistency must be just right entails Temperature must be just right.
The patient bears some responsibility for successful care.
Both doctor and patient bear some responsibility for successful care.
Disjunction:Life is either a daring adventure or nothing at all does not entail, but is entailed by, Life is a daring adventure.
He has a blind trust.
Either he has a blind trust or he has a conflict of interest.
Conditionals:If both apply, they are essentially impossible does not entail They are essentially impossible. Conditionals are a little bit more complicated because their use in language does not always mirror their meaning in logic. For example, they may be used at a higher level than the at-issue assertion: If you think about it, it is the perfect reverse psychology tactic entails It is the perfect reverse psychology tactic
Pedro does not have a donkey.
If Pedro has a donkey, then he beats it.
Universal, Existential Quantifiers are often triggered by words such as all, some, many, and no. There is a rich body of work modeling their meaning in mathematical logic with generalized quantifiers. In these two categories, we focus on straightforward inferences from the natural language analogs of universal and existential quantification:
Universal:All parakeets have two wings entails, but is not entailed by My parakeet has two wings.```
No one has a set of principles to live by.
Everyone has a set of principles to live by.
Existential:Some parakeets have two wings does not entail, but is entailed by My parakeet has two wings.```
No one knows how turtles reproduce.
Susan knows how turtles reproduce.
Upward Monotone, Downward Monotone, Non-Monotone
Monotonicity is a property of argument positions in certain logical systems. In general, it gives a way of deriving entailment relations between expressions that differ on only one subexpression. In language, it can explain how some entailments propagate through logical operators and quantifiers. For example, note that pet entails pet squirrel, which further entails happy pet squirrel. We can demonstrate how the quantifiers a, no and exactly one differ with respect to monotonicity:
In all of these examples, the pet squirrel appears in what we call the restrictor position of the quantifier. We say: a is upward monotone in its restrictor: an entailment in the restrictor yields an entailment of the whole statement. no is downward monotone in its restrictor: an entailment in the restrictor yields an entailment of the whole statement in the opposite direction. exactly one is non-monotone in its restrictor: entailments in the restrictor do not yield entailments of the whole statement.
In this way, entailments between sentences that are built off of entailments of sub-phrases almost always rely on monotonicity judgments; see, for example, Lexical Entailment. However, because this is such a general class of sentence pairs, to keep the Logic category meaningful we do not always tag these examples with monotonicity; see Definite Descriptions and Monotonicity for details. To draw an analogy, these types of monotonicity are closely related to covariance, contravariance, and invariance of type arguments in programming languages with subtyping.
Intervals/Numbers, Temporal There are some higher-level facets of reasoning that have been traditionally modeled using logic; these include actual mathematical reasoning (entailments based off of numbers) and temporal reasoning (which is often modeled as reasoning about a mathematical timeline).
Intervals/Numbers:I have had more than 2 drinks tonight entails I have had more than 1 drink tonight.
I failed my resolutions in 1995.
I have failed my resolutions every year since 1997, and it is now 2008.
Temporal:Mary left before John entered entails John entered after Mary left.
John entered after Mary left.
Mary left before John entered.
An important component of understanding the meaning of a sentence is understanding how its parts are composed together into a whole. In this category, we address issues across that spectrum, from syntactic ambiguity to semantic roles and coreference.
en_df['predicate-argument-structure'].value_counts()
These two categories deal purely with resolving syntactic ambiguity. Relative clauses and coordination scope are both sources of a great amount of ambiguity in English.
Mao was chairman of the Communist Party from before its accession to power in 1949 until his death in 1976.
The move marks an end to a system put in place by Deng Xiaoping in the 1980s to prevent the rise of another Mao, who was chairman of the Communist Party from before its accession to power in 1949 until his death in 1976.
Prepositional phrase attachment is a particularly difficult problem that syntactic parsers in NLP systems continue to struggle with. We view it as a problem both of syntax and semantics, since prepositional phrases can express a wide variety of semantic roles and often semantically apply beyond their direct syntactic attachment.
On Sunday, Jane had a party.
Jane had a party on Sunday.
Verbs select for particular arguments, especially as their subject and object, which might be interchangeable depending on the context or the surface form. One example is the ergative alternation:
Other rearrangements of core arguments, such as those seen in Symmetry/Collectivity, also fall under the Core Arguments label.
Alternations: Active/Passive, Genitives/Partitives, Nominalization, Datives
All four of these categories correspond to syntactic alternations that are known to follow specific patterns in English:
Often, the argument of a verb or other predicate is omitted (elided) in the text, with the reader filling in the gap. We can construct entailment examples by explicitly filling in the gap with the correct or incorrect referents. For example:
Coreference refers to when multiple expressions refer to the same entity or event. It is closely related to Anaphora, where the meaning of an expression depends on another (antecedent) expression in context. These phenomena have significant overlap, for example, with pronouns (she, we, it), which are anaphors that are co-referent with their antecedents. However, they also may occur independently, for example, coreference between two definite noun phrases (e.g., Theresa May and the British Prime Minister) that refer to the same entity, or anaphora from a word like other which requires an antecedent to distinguish something from. In this category we only include cases where there is an explicit phrase (anaphoric or not) that is co-referent with an antecedent or other phrase. We construct examples for these in much the same way as for Ellipsis/Implicits.
George fell into the water.
George went to the lake to catch a fish, but he fell into the water.
Many modifiers, especially adjectives, allow non-intersective uses, which affect their entailment behavior. For example:
Non-intersective: He is a fake surgeon does not entail He is a surgeon
Generally, an intersective use of a modifier, like old in old men, is one which may be interpreted as referring to the set of entities with both properties (they are old and they are men). Linguists often formalize this using set intersection, hence the name. It is related to Factivity; for example fake may be regarded as a counter-implicative modifier, and these examples will be labeled as such. However, we choose to categorize intersectivity under predicate-argument structure rather than lexical semantics, because generally the same word will admit both intersective and non-intersective uses, so it may be regarded as an ambiguity of argument structure.
Restrictivity is most often used to refer to a property of uses of noun modifiers; in particular, a restrictive use of a modifier is one that serves to identify the entity or entities being described, whereas a non-restrictive use adds extra details to the identified entity. The distinction can often be highlighted by entailments:
Modifiers that are commonly used non-restrictively are appositives, relative clauses starting with which or who (although these can be restrictive, despite what your English teacher might tell you), and expletives (e.g. pesky). However, non-restrictive uses can appear in many forms.
Ambiguity in restrictivity is often employed in certain kinds of jokes (warning: language).
Strictly speaking, world knowledge and common sense are required on every level of language understanding, for disambiguating word senses, syntactic structures, anaphora, and more. So our entire suite (and any test of entailment) does test these features to some degree. However, in these categories, we gather examples where the entailment rests not only on correct disambiguation of the sentences, but also application of extra knowledge, whether it is concrete knowledge about world affairs or more common-sense knowledge about word meanings or social or physical dynamics.
en_df['knowledge'].value_counts()
In this category we focus on knowledge that can clearly be expressed as facts, as well as broader and less common geographical, legal, political, technical, or cultural knowledge. Examples:
In this category we focus on knowledge that is more difficult to express as facts and that we expect to be possessed by most people independent of cultural or educational background. This includes a basic understanding of physical and social dynamics as well as lexical meaning (beyond simple lexical entailment or logical relations). Examples:
print('Russian accuracy: ', round(rus_df.equal.sum()/rus_df.shape[0],3))
print('English accuracy: ', round(en_df.equal.sum()/en_df.shape[0],3))
print('\nEntailment percent (label)', round(rus_df.label.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (Russian)', round(rus_df.prediction.value_counts()['entailment']/rus_df.shape[0],3))
print('Entailment percent (English)', round(en_df.prediction.value_counts()['entailment']/en_df.shape[0],3))
The accuracy of the Russian model is higher than that of the English one. In addition, the distribution of the labels of the Russian model (38.1% Entailment) is also significantly closer to the original (41.6% Entailment) than the English (78.4%).
The English model has a significant bias towards Entailment, it is predicted almost twice as often as necessary. The Russian model does not observe this, on the contrary, it slightly lacks Entailment.
Conclusion: the distribution of labels in the Russian model is much closer to the original, respectively, and the accuracy is higher.
Further, the Russian and English models were compared with each other on three grounds: logic, lexical-semantics and predicate-argument-structure. At the same time, for each characteristic, only those groups were left in the analysis, in which the number of examples is more than 10 (since it is not quite correct to count statisticians with a very small number of examples, they may turn out to be not indicative and even misleading).
def group_dataframe_by_label(df, label):
df_short = df[df.logic.isin(df[label].value_counts().head(13).index)]
df_short.equal = df_short.equal.map({False:0, True:1})
df_short.label = df_short.label.map({'not_entailment':0, 'entailment':1})
df_short.prediction = df_short.prediction.map({'not_entailment':0, 'entailment':1})
df_short['ss'] = [1] *df_short.shape[0]
res_df = df_short[[label,'equal','label','prediction', 'ss']].groupby(label).sum()
for col in ['equal', 'label', 'prediction']:
res_df[col] = res_df[col]/res_df.ss
res_df.sort_values('equal', ascending = False)
return res_df
res_df = group_dataframe_by_label(en_df, 'logic')
sns.heatmap(res_df[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
sns.heatmap(group_dataframe_by_label(rus_df, 'logic')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Logic
The Russian BERT model did much better with Negation, Disjunction and Non-monotone, but completely dipped in Double negation. The latter possibly due to the peculiarities of the Russian language.
sns.heatmap(group_dataframe_by_label(en_df, 'lexical-semantics')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
sns.heatmap(group_dataframe_by_label(rus_df, 'lexical-semantics')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Lexical-semantics
The Russian BERT model performs much better than the English one with Quantifiers and Factivity, but could not cope with Morphological negation and Named entities.
sns.heatmap(group_dataframe_by_label(en_df, 'predicate-argument-structure')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
sns.heatmap(group_dataframe_by_label(rus_df, 'predicate-argument-structure')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
Predicate-argument structure
Of the most notable, the Russian BERT completely fails with Restrictivity while the English model shows a good result.
sns.heatmap(group_dataframe_by_label(en_df, 'knowledge')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)
sns.heatmap(group_dataframe_by_label(rus_df, 'knowledge')[['equal', 'label', 'prediction']].rename(columns = {'equal':'correct_pred'}), annot=True)