MOdel ResOurCe COnsumption evaluation project
MOROCCO project is aimed to evaluate Russian SuperGLUE models performance: inference speed, GPU RAM usage.
We highly welcome leaderboard participants to move from static text submissions with predictions to public repositories for their models and reproducible Docker-containers.
At the moment, the project team is measuring the performance of all publicly available models:
To add your model to the performance leaderboard: 1) create a docker container according to the example 2) specify the public URL of the model and the container code to run it when submitting the results.
Monolingual Russian BERT (Bidirectional Encoder Representations from Transformers) in DeepPavlov realization: cased, 12-layer, 768-hidden, 12-heads, 180M parameters
RuBERT was trained on the Russian part of Wikipedia and news data. The training data was used to build vocabulary of Russian subtokens and then multilingual version of BERT-base was used for the initialization for RuBERT model
BERT in DeepPavlov realization: Russian, cased, 12-layer, 768-hidden, 12-heads, 180M parameters
Conversational RuBERT was trained on OpenSubtitles, Dirty, Pikabu, and Social Media segment of Taiga corpus. New vocabulary ws assembled for Conversational RuBERT model on this data; the model was initialized with RuBERT weights.
Russian GPT3 models trained with 2048 context length including rugpt3-small, rugpt3-medium, rugpt3-large
To measure model GPU RAM usage we run a container with a single record as input, measure maximum GPU RAM consumption, repeat procedure 5 times, take median value.
rugpt3-small have approximately the same GPU RAM usage.
rugpt3-medium is ~2 times larger than
rugpt3-large is ~3 times larger.
GPU RAM usage, GB
To measure inference speed we run a container with 2000 records as input, with batch size 32. On all tasks batch size 32 utilizes GPU almost at 100%. To estimate initialization time we run a container with input of size 1. Inference speed is (input size = 2000) / (total time - initialization time). We repeat procedure 5 times, take median value.
Inference speed, records per second
Each disc corresponds to baseline model, disc size is proportional to GPU RAM usage. By X axis there is model inference speed in records per second, by Y axis model score averaged by 9 Russian SuperGLUE tasks.
rugpt3-smallprocesses ~200 records per second while
rugpt3-large— ~60 records/second.
bert-multilingualis a bit slower then
rubert*due to worse Russian tokenizer.
bert-multilingualsplits text into more tokens, has to process larger batches.
rugpt3-largeperform worse then smaller
rugpt3-largehas more parameters then
rugpt3-mediumbut is currently trained for less time and has lower score.