Modern universal language models and transformers such as BERT, ELMo, XLNet, RoBERTa and others need to be properly compared and evaluated. In the last year, new models and methods for pretraining and transfer learning have driven striking performance improvements across a range of language understanding tasks.
We offer testing methodology based on tasks, typically proposed for “strong AI” — logic, commonsense, reasoning. Adhering to the GLUE and SuperGLUE methodology, we present a set of test tasks for general language understanding and leaderboard models.
For the first time a complete test for Russian language was developed, which is similar to its English analog. Many datasets were composed for the first time, and a leaderboard of models for the Russian language with comparable results is also presented.