MOdel ResOurCe COnsumption evaluation project
MOROCCO project is aimed to evaluate Russian SuperGLUE models performance: inference speed, GPU RAM usage.
We highly welcome leaderboard participants to move from static text submissions with predictions to public repositories for their models and reproducible Docker-containers.
This page contains general description of the project. For public docker images and technical documentation please see Marocco github page. All credits go to @kuk
At the moment, the project team is measuring the performance of all publicly available models: rubert
, rubert-conversational
, bert-multilingual
, rugpt3-small
, rugpt3-medium
, rugpt3-small
, rugpt3-large
.
To add your model to the performance leaderboard: 1) create a docker container according to the example 2) specify the public URL of the model and the container code to run it when submitting the results.
Monolingual Russian BERT (Bidirectional Encoder Representations from Transformers) in DeepPavlov realization: cased, 12-layer, 768-hidden, 12-heads, 180M parameters
RuBERT was trained on the Russian part of Wikipedia and news data. The training data was used to build vocabulary of Russian subtokens and then multilingual version of BERT-base was used for the initialization for RuBERT model
BERT in DeepPavlov realization: Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters
Conversational RuBERT was trained on OpenSubtitles, Dirty, Pikabu, and Social Media segment of Taiga corpus. New vocabulary ws assembled for Conversational RuBERT model on this data; the model was initialized with RuBERT weights.
Classic mBert
Russian GPT3 models trained with 2048 context length including rugpt3-small, rugpt3-medium, rugpt3-large
To measure model GPU RAM usage we run a container with a single record as input, measure maximum GPU RAM consumption, repeat procedure 5 times, take median value. rubert
, rubert-conversational
, bert-multilingual
, rugpt3-small
have approximately the same GPU RAM usage. rugpt3-medium
is ~2 times larger than rugpt3-small
, rugpt3-large
is ~3 times larger.
danetqa | muserc | parus | rcb | rucos | russe | rwsd | terra | lidirus | |
---|---|---|---|---|---|---|---|---|---|
rubert | 2.40 | 2.40 | 2.39 | 2.39 | 2.40 | 2.39 | 2.39 | 2.39 | 2.39 |
rubert-conversational | 2.40 | 2.40 | 2.39 | 2.39 | 2.40 | 2.39 | 2.39 | 2.39 | 2.39 |
bert-multilingual | 2.40 | 2.40 | 2.39 | 2.39 | 2.40 | 2.39 | 2.40 | 2.39 | 2.39 |
rugpt3-small | 2.38 | 2.38 | 2.36 | 2.37 | 2.38 | 2.36 | 2.36 | 2.37 | 2.36 |
rugpt3-medium | 4.41 | 4.38 | 4.39 | 4.39 | 4.38 | 4.38 | 4.41 | 4.39 | 4.39 |
rugpt3-large | 7.49 | 7.49 | 7.50 | 7.50 | 7.49 | 7.49 | 7.51 | 7.50 | 7.50 |
GPU RAM usage, GB
To measure inference speed we run a container with 2000 records as input, with batch size 32. On all tasks batch size 32 utilizes GPU almost at 100%. To estimate initialization time we run a container with input of size 1. Inference speed is (input size = 2000) / (total time - initialization time). We repeat procedure 5 times, take median value.
danetqa | muserc | parus | rcb | rucos | russe | rwsd | terra | lidirus | |
---|---|---|---|---|---|---|---|---|---|
rubert | 118 | 4 | 1070 | 295 | 9 | 226 | 102 | 297 | 165 |
rubert-conversational | 103 | 4 | 718 | 289 | 8 | 225 | 101 | 302 | 171 |
bert-multilingual | 90 | 4 | 451 | 194 | 7 | 164 | 85 | 195 | 136 |
rugpt3-small | 97 | 4 | 872 | 289 | 8 | 163 | 105 | 319 | 176 |
rugpt3-medium | 45 | 2 | 270 | 102 | 3 | 106 | 70 | 111 | 106 |
rugpt3-large | 27 | 1 | 137 | 53 | 2 | 75 | 49 | 61 | 69 |
Inference speed, records per second
Each disc corresponds to baseline model, disc size is proportional to GPU RAM usage. By X axis there is model inference speed in records per second, by Y axis model score averaged by 9 Russian SuperGLUE tasks.
rugpt3-small
processes ~200 records per second while rugpt3-large
— ~60 records/second.bert-multilingual
is a bit slower then rubert*
due to worse Russian tokenizer. bert-multilingual
splits text into more tokens, has to process larger batches.rugpt3-medium
, rugpt3-large
perform worse then smaller rubert*
models.rugpt3-large
has more parameters then rugpt3-medium
but is currently trained for less time and has lower score.