allegro/herbert-base-cased

feature extractiontransformerspltransformerspytorchtfjaxbertfeature-extractioncc-by-4.0
291.2K

HerBERT

HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish.

Model training and experiments were conducted with transformers in version 2.9.

Corpus

HerBERT was trained on six different corpora available for Polish language:

CorpusTokensDocuments
CCNet Middle3243M7.9M
CCNet Head2641M7.0M
National Corpus of Polish1357M3.9M
Open Subtitles1056M1.1M
Wikipedia260M1.4M
Wolne Lektury41M5.5k

Tokenizer

The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.

We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

Usage

Example code:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")

output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)

License

CC BY 4.0

Citation

If you use this model, please cite the following paper:

@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}

Authors

The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: klejbenchmark@allegro.pl

DEPLOY IN 60 SECONDS

Run herbert-base-cased on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.