MOROCCO¶

MOdel ResOurCe COnsumption evaluation project

MOROCCO project is aimed to evaluate Russian SuperGLUE models performance: inference speed, GPU RAM usage.

We highly welcome leaderboard participants to move from static text submissions with predictions to public repositories for their models and reproducible Docker-containers.

This page contains general description of the project. For public docker images and technical documentation please see Marocco github page. All credits go to @kuk

Models¶

At the moment, the project team is measuring the performance of all publicly available models: rubert, rubert-conversational, bert-multilingual, rugpt3-small, rugpt3-medium, rugpt3-small, rugpt3-large.

To add your model to the performance leaderboard: 1) create a docker container according to the example 2) specify the public URL of the model and the container code to run it when submitting the results.

RuBert¶

Monolingual Russian BERT (Bidirectional Encoder Representations from Transformers) in DeepPavlov realization: cased, 12-layer, 768-hidden, 12-heads, 180M parameters

RuBERT was trained on the Russian part of Wikipedia and news data. The training data was used to build vocabulary of Russian subtokens and then multilingual version of BERT-base was used for the initialization for RuBERT model

Docs: http://docs.deeppavlov.ai/en/master/features/models/bert.html
Repository: https://github.com/deepmipt/DeepPavlov/blob/master/docs/features/models/bert.rst
HuggingFace: https://huggingface.co/DeepPavlov/rubert-base-cased

RuBert Conversational¶

BERT in DeepPavlov realization: Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters

Conversational RuBERT was trained on OpenSubtitles, Dirty, Pikabu, and Social Media segment of Taiga corpus. New vocabulary ws assembled for Conversational RuBERT model on this data; the model was initialized with RuBERT weights.

Bert multilingual¶

Classic mBert

Docs: https://github.com/google-research/bert
Repository: https://github.com/google-research/bert#pre-trained-models
HuggingFace: https://huggingface.co/bert-base-multilingual-cased

RuGPT3 Family¶

Russian GPT3 models trained with 2048 context length including rugpt3-small, rugpt3-medium, rugpt3-large

Docs: https://github.com/sberbank-ai/ru-gpts
Repository: https://github.com/sberbank-ai/ru-gpts
HuggingFace: https://huggingface.co/sberbank-ai

Performance¶

GPU RAM¶

To measure model GPU RAM usage we run a container with a single record as input, measure maximum GPU RAM consumption, repeat procedure 5 times, take median value. rubert, rubert-conversational, bert-multilingual, rugpt3-small have approximately the same GPU RAM usage. rugpt3-medium is ~2 times larger than rugpt3-small, rugpt3-large is ~3 times larger.

	danetqa	muserc	parus	rcb	rucos	russe	rwsd	terra	lidirus
rubert	2.40	2.40	2.39	2.39	2.40	2.39	2.39	2.39	2.39
rubert-conversational	2.40	2.40	2.39	2.39	2.40	2.39	2.39	2.39	2.39
bert-multilingual	2.40	2.40	2.39	2.39	2.40	2.39	2.40	2.39	2.39
rugpt3-small	2.38	2.38	2.36	2.37	2.38	2.36	2.36	2.37	2.36
rugpt3-medium	4.41	4.38	4.39	4.39	4.38	4.38	4.41	4.39	4.39
rugpt3-large	7.49	7.49	7.50	7.50	7.49	7.49	7.51	7.50	7.50

GPU RAM usage, GB

Inference speed¶

To measure inference speed we run a container with 2000 records as input, with batch size 32. On all tasks batch size 32 utilizes GPU almost at 100%. To estimate initialization time we run a container with input of size 1. Inference speed is (input size = 2000) / (total time - initialization time). We repeat procedure 5 times, take median value.

	danetqa	muserc	parus	rcb	rucos	russe	rwsd	terra	lidirus
rubert	118	4	1070	295	9	226	102	297	165
rubert-conversational	103	4	718	289	8	225	101	302	171
bert-multilingual	90	4	451	194	7	164	85	195	136
rugpt3-small	97	4	872	289	8	163	105	319	176
rugpt3-medium	45	2	270	102	3	106	70	111	106
rugpt3-large	27	1	137	53	2	75	49	61	69

Inference speed, records per second

General evaluation¶

Each disc corresponds to baseline model, disc size is proportional to GPU RAM usage. By X axis there is model inference speed in records per second, by Y axis model score averaged by 9 Russian SuperGLUE tasks.

Smaller models have higher inference speed. rugpt3-small processes ~200 records per second while rugpt3-large — ~60 records/second.
bert-multilingual is a bit slower then rubert* due to worse Russian tokenizer. bert-multilingual splits text into more tokens, has to process larger batches.
It is common that larger models show higher score but in our case rugpt3-medium, rugpt3-large perform worse then smaller rubert* models.
rugpt3-large has more parameters then rugpt3-medium but is currently trained for less time and has lower score.