title: "BERT 复习" categories:
复习
Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly "see itself", and the model could trivially predict the target word in a multi-layered context (Devlin, et al., 2018).
这是我当时读 RoBERTa 的时候才发现的
The original BERT implementation performed masking once during data preprocessing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training. Thus, each training sequence was seen with the same mask four times during training.
We compare this strategy with dynamic masking where we generate the masking pattern every time we feed a sequence to the model. This becomes crucial when pretraining for more steps or with larger datasets.
有不少 "解读" 文章说 dynamic masking 指将数据复制 10 份做不同的 masking, 但根据原文和 BERT 源码, 这其实是 BERT 做的事情. RoBERTa 的 dynamic 指在喂给模型之前才 masking (即时生成).
这里 的 create_pretraining_data.py
下面的段落是 3 年前 BERT initial release 就有的.
flags.DEFINE_integer(
"dupe_factor", 10,
"Number of times to duplicate the input data (with different masks).")
for _ in range(dupe_factor):
for document_index in range(len(all_documents)):
instances.extend(
create_instances_from_document(
all_documents, document_index, max_seq_length, short_seq_prob,
masked_lm_prob, max_predictions_per_seq, vocab_words, rng))
There are two existing strategies for applying pre-trained language representations to downstream tasks (Devlin, et al., 2018):
Note that if you are used to freezing the body of your pretrained model (like in computer vision) the above may seem a bit strange, as we are directly fine-tuning the whole model without taking any precaution. It actually works better this way for Transformers model (so this is not an oversight on our side). If you’re not familiar with what “freezing the body” of the model means, forget you read this paragraph. From Hugging Face Fine-tuning a pretrained model
# quick example: freezing first 4 encoder layers
for module in [bert.embeddings, bert.encoder.layer[:4]]:
for param in module.parameters():
param.requires_grad = False
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, rel_extractor.parameters()))
Todo