BERT模型fine-tuning代码解析（一）

BERT模型fine-tuning过程代码实战，以run_classifier.py为例。

BERT官方Github地址：https://github.com/google-research/bert ，其中对BERT模型进行了详细的介绍，更详细的可以查阅原文献：https://arxiv.org/abs/1810.04805 。

BERT本质上是一个两段式的NLP模型。第一个阶段叫做：Pre-training，跟WordEmbedding类似，利用现有无标记的语料训练一个语言模型。第二个阶段叫做：Fine-tuning，利用预训练好的语言模型，完成具体的NLP下游任务。

Google已经投入了大规模的语料和昂贵的机器帮我们完成了Pre-training过程，这里介绍一下不那么expensive的fine-tuning过程。

回到Github中的代码，只有run_classifier.py和run_squad.py是用来做fine-tuning 的，其他的可以暂时先不管。这里使用run_classifier.py进行句子分类任务。

代码解析

从主函数开始，可以发现它指定了必须的参数：

if __name__ == "__main__":
  flags.mark_flag_as_required("data_dir")
  flags.mark_flag_as_required("task_name")
  flags.mark_flag_as_required("vocab_file")
  flags.mark_flag_as_required("bert_config_file")
  flags.mark_flag_as_required("output_dir")
  tf.app.run()

从这些参数出发，可以对run_classifier.py进行探索：

data_dir

指的是我们的输入数据的文件夹路径。查看代码，不难发现，作者给出了输入数据的格式：

class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.
    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label

可以发现它要求的输入分别是guid, text_a, text_b, label，其中text_b和label为可选参数。例如我们要做的是单个句子的分类任务，那么就不需要输入text_b；另外，在test样本中，我们便不需要输入lable。

task_name

这里的task_name，一开始可能不好理解它是用来做什么的。仔细查看代码可以发现：

processors = {
    "cola": ColaProcessor,
    "mnli": MnliProcessor,
    "mrpc": MrpcProcessor,
    "xnli": XnliProcessor,
}

task_name = FLAGS.task_name.lower()

if task_name not in processors:
    raise ValueError("Task not found: %s" % (task_name))

processor = processors[task_name]()

task_name是用来选择processor的。

继续查看processor，这里以“mrpc”为例：

class MrpcProcessor(DataProcessor):
  """Processor for the MRPC data set (GLUE version)."""

  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      if i == 0:
        continue
      guid = "%s-%s" % (set_type, i)
      text_a = tokenization.convert_to_unicode(line[3])
      text_b = tokenization.convert_to_unicode(line[4])
      if set_type == "test":
        label = "0"
      else:
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
    return examples

可以发现这个processor就是用来对data_dir中输入的数据进行预处理的。
同时也能发现，在data_dir中我们需要将数据处理成.tsv格式，训练集、开发集和测试集分别是train.tsv, dev.tsv, test.tsv，这里我们暂时只使用train.tsv和dev.tsv。另外，label在get_labels()设定，如果是二分类，则将label设定为[“0”,”1”]，同时_create_examples()中，给定了如何获取guid以及如何给text_a, text_b和label赋值。

到这里，似乎已经明白了什么。对于这个fine-tuning过程，我们要做的只是：

准备好一个12G显存左右的GPU，没有也不用担心，可以使用谷歌免费的GPU
准备好train.tsv, dev.tsv以及test.tsv
新建一个跟自己task_name对应的processor，用于将train.tsv、dev.tsv以及test.tsv中的数据提取出来赋给text_a, text_b, label
下载好Pre-training模型，设定好相关参数，run就完事了

“vocab_file”, “bert_config_file”以及”output_dir”很好理解，分别是BERT预训练模型的路径和fine-tuning过程输出的路径

fine-tuning实践

准备好train.tsv, dev.tsv以及test.tsv

tsv，看上去怪怪的。其实好像跟csv没有多大区别，反正把后缀改一改就完事。这里我要做的是一个4分类，示例在下面：

train.tsv: （标签+’\t’+句子）

train.tsv

dev.tsv:（标签+’\t’+句子）

dev.tsv

test.tsv:（句子）

1542355007071

新建processor

这里我将自己的句子分类任务命名为”bert_move”：

processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "bert_move": MoveProcessor
  }

然后仿照MrpcProcessor创建自己的MoveProcessor：

class MoveProcessor(DataProcessor):
  """Processor for the move data set ."""

  def get_train_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

  def get_dev_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

  def get_test_examples(self, data_dir):
    """See base class."""
    return self._create_examples(
        self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

  def get_labels(self):
    """See base class."""
    return ["0", "1", "2", "3"]

  def _create_examples(self, lines, set_type):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, line) in enumerate(lines):
      guid = "%s-%s" % (set_type, i)
      if set_type == "test":
        text_a = tokenization.convert_to_unicode(line[0])
        label = "0"
      else:
        text_a = tokenization.convert_to_unicode(line[1])
        label = tokenization.convert_to_unicode(line[0])
      examples.append(
          InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

其中，主要修改的是:

get_labels()中设置4分类的标签[‘1’, ‘2’,’3’,’4’]
_create_examples()中提取文本赋给text_a和label，并做一个判断，当文件名是test.tsv时，只赋给text_a，label直接给0
guid则为自动生成

设定参数，运行fine-tuning

相关的参数可以直接在run_classifier.py中一开始的flags里面直接做修改，然后运行就行。但是又研究了一下Github里面设置参数的方式：

python run_classifier.py \
  --task_name=MRPC \
  --do_train=true \
  --do_eval=true \
  --data_dir=$GLUE_DIR/MRPC \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=32 \
  --learning_rate=2e-5 \
  --num_train_epochs=3.0 \
  --output_dir=/tmp/mrpc_output/

对其中的一些参数做一些解释：

do_train, do_eval和do_test至少要有一个是True，一般做fine-tuning训练时，将do_train和do_eval设置为True，do_test设置为False(默认)，当模型都训练好了，就可以只将do_test设置为True，将会自动调用保存在output_dir中已经训练好的模型，进行测试。
max_seq_length、train_batch_size可以根据自己的设备情况适当调整，目前默认的参数在GTX 1080Ti 以及谷歌Colab提供的免费GPU Tesla K80中经过测试，完美运行。
关于预训练模型，官方给出了两种模型，Large和Base，具体可以看Github介绍以及论文，目前上面的两种设备经过多次测试，只能支持Base模型，Large模型显然需要更大显存的机器（TPU）。

探索了很久，上面谷歌给出的这种运行方式，好像只有谷歌的Colab可以完美支持，其他的终端或多或少都会出现问题。

以上。