2020-09-17

雑記

一昨日と昨日の記事を Git に移行した。
GitHub - CookieBox26/ML: machine learning

tests/test_bert_tokenization.py；トークナイザの挙動の確認をテストにした。
script.py；モデルのコンフィグレーションと構造をプリントしただけ。
- 語数が 28996 だったのがわかった。

◆ モデルのコンフィグレーション
BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "B-corporation",
    "1": "B-creative-work",
    "2": "B-group",
    "3": "B-location",
    "4": "B-person",
    "5": "B-product",
    "6": "I-corporation",
    "7": "I-creative-work",
    "8": "I-group",
    "9": "I-location",
    "10": "I-person",
    "11": "I-product",
    "12": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 28996
}

◆ モデルの埋め込み層
BertEmbeddings(
  (word_embeddings): Embedding(28996, 1024, padding_idx=0)
  (position_embeddings): Embedding(512, 1024)
  (token_type_embeddings): Embedding(2, 1024)
  (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)
◆ モデルのエンコーダ層（以下が24層重なっているので最初の1層だけ）
BertLayer(
  (attention): BertAttention(
    (self): BertSelfAttention(
      (query): Linear(in_features=1024, out_features=1024, bias=True)
      (key): Linear(in_features=1024, out_features=1024, bias=True)
      (value): Linear(in_features=1024, out_features=1024, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (output): BertSelfOutput(
      (dense): Linear(in_features=1024, out_features=1024, bias=True)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (intermediate): BertIntermediate(
    (dense): Linear(in_features=1024, out_features=4096, bias=True)
  )
  (output): BertOutput(
    (dense): Linear(in_features=4096, out_features=1024, bias=True)
    (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
)
◆ モデルのプーラー層
BertPooler(
  (dense): Linear(in_features=1024, out_features=1024, bias=True)
  (activation): Tanh()
)
◆ ドロップアウトと全結合層（New!）
Dropout(p=0.1, inplace=False)
Linear(in_features=1024, out_features=13, bias=True)
◆ 適当な文章をモデルに流してみる．→ 14トークン×13クラスの予測結果になっている（サイズが）．
torch.Size([1, 14, 13])

2020-09-16

雑記

transformers で学習済みの BERT モデルから固有表現抽出用のモデルインスタンスをつくるまでだけです。
→ GitHub に移行しました。GitHub - CookieBox26/ML: machine learning

コード
出力
Python環境

コード

import torch
from transformers import (
    BertConfig,
    BertTokenizer,
    BertForTokenClassification,
)

def main():
    # 各トークンを以下の13クラスのいずれかに分類するような固有表現抽出をしたい．
    labels = [
        'B-corporation',
        'B-creative-work',
        'B-group',
        'B-location',
        'B-person',
        'B-product',
        'I-corporation',
        'I-creative-work',
        'I-group',
        'I-location',
        'I-person',
        'I-product',
        'O'
    ]
    id2label = {i: label for i, label in enumerate(labels)}
    label2id = {label: i for i, label in enumerate(labels)}

    # 利用する学習済みBERTモデルの名前を指定する．
    model_name = 'bert-large-cased'

    # 学習済みモデルに対応したトークナイザを生成する．
    tokenizer = BertTokenizer.from_pretrained(
        pretrained_model_name_or_path=model_name,
    )

    # 学習済みモデルから各トークン分類用モデルのインスタンスを生成する．
    # 設定する内容にもよるが必ずしも設定オブジェクトを生成して渡す必要はない．
    model = BertForTokenClassification.from_pretrained(
        pretrained_model_name_or_path=model_name,
        id2label=id2label,  # 各トークンに対する出力を13次元にしたいのでこれを渡す．
    )
    # 一部の重みが初期化されていませんよという警告が出るが（クラス分類する層が
    # 初期化されていないのは当然）面倒なので無視する．
    # print(model)  # 24層あるのでプリントすると長い．


    print('◆ 適当な文章をID列にしてみる．')
    sentence = 'The Empire State Building officially opened on May 1, 1931.'

    # BERT に文章を流すとき文頭に特殊トークン [CLS] 、
    # 文末に特殊トークン [SEP] が想定されている．
    # tokenizer.encode() でID列にすると勝手に付加されている．
    print('◇')
    ids = tokenizer.encode(sentence)
    for id_ in ids:
        token = tokenizer.convert_ids_to_tokens(id_)
        print(str(id_).ljust(5), tokenizer.convert_ids_to_tokens(id_))

    # 先にトークン列が手元にある場合は特殊トークンを明示的に付加する．
    print('◇')
    tokens = tokenizer.tokenize(sentence)
    tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token]
    for token in tokens:
        id_ = tokenizer.convert_tokens_to_ids(token)
        print(str(id_).ljust(5), tokenizer.convert_ids_to_tokens(id_))

    print('◆ モデルに流してみる．→ 14トークン×13クラスの予測結果になっている（サイズが）．')
    inputs = torch.tensor([tokenizer.encode(sentence)])  # ID列をテンソル化して渡す．
    outputs = model(inputs)
    print(outputs[0].size())

出力

# ここで一部の重みが初期化されていませんよという警告が出るが気にしないことにする．
◆ 適当な文章をID列にしてみる．
◇
101   [CLS]
1109  The
2813  Empire
1426  State
4334  Building
3184  officially
1533  opened
1113  on
1318  May
122   1
117   ,
3916  1931
119   .
102   [SEP]
◇
101   [CLS]
1109  The
2813  Empire
1426  State
4334  Building
3184  officially
1533  opened
1113  on
1318  May
122   1
117   ,
3916  1931
119   .
102   [SEP]
◆ モデルに流してみる．→ 14トークン×13クラスの予測結果になっている（サイズが）．
torch.Size([1, 14, 13])

Python環境

[[source]]
name = "pypi"
url = "https://pypi.org/simple"
verify_ssl = true

[packages]
torch = "==1.4.0"
transformers = "==3.1.0"

[requires]
python_version = "3.7.0"

2020-09-15

雑記

transformers で学習済みの BERT モデルから固有表現抽出用のモデルインスタンスをつくるまでだけです。
→ 改善版（2020-09-16） → GitHub に移行しました。GitHub - CookieBox26/ML: machine learning

from transformers import (
    BertConfig,
    BertForTokenClassification
)

def main():
    # 各トークンを以下の13クラスのいずれかに分類するような固有表現抽出をしたい．
    labels = [
        'B-corporation',
        'B-creative-work',
        'B-group',
        'B-location',
        'B-person',
        'B-product',
        'I-corporation',
        'I-creative-work',
        'I-group',
        'I-location',
        'I-person',
        'I-product',
        'O'
    ]
    id2label = {i: label for i, label in enumerate(labels)}
    label2id = {label: i for i, label in enumerate(labels)}

    # 利用する学習済みBERTモデルの名前を指定する．
    model_name = 'bert-large-cased'

    # 設定オブジェクトをつくる．
    config = BertConfig.from_pretrained(
        # 利用する学習済みモデルは必ず教える．
        pretrained_model_name_or_path=model_name,
        # 今回は13クラスに分類する固有表現抽出をしたいので以下も教える．
        id2label=id2label,
        label2id=label2id,  # ただこれは結局渡さなくても大丈夫だった．
    )
    print(config)

    # 学習済みモデルの名前と設定オブジェクトから各トークン分類用モデルを生成する．
    model = BertForTokenClassification.from_pretrained(
        pretrained_model_name_or_path=model_name,
        config=config
    )
    print(model)  # 24層あるのでプリントすると長い．


if __name__ == '__main__':
    main()

2020-09-14

雑記： transformers の examples/token-classification を実行するだけ

雑記

以下の transformers リポジトリの固有表現抽出タスクの例（WNUT’17 データの方）を実行するだけです。
https://github.com/huggingface/transformers/tree/master/examples/token-classification
但し、使用する学習済みモデルを bert-large-cased にすると手元のマシンでは DefaultCPUAllocator: can't allocate memory になったので、さしあたり以下では bert-base-cased に変更しています。

手順
補足

手順

transformers リポジトリを clone して examples/token-classification ディレクトリに移動する

git clone https://github.com/huggingface/transformers.git
cd transformers/examples/token-classification/

WNUT’17 データを取得する

mkdir -p data_wnut_17
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/wnut17train.conll'  | tr '\t' ' ' > data_wnut_17/train.txt.tmp
curl -L 'https://github.com/leondz/emerging_entities_17/raw/master/emerging.dev.conll' | tr '\t' ' ' > data_wnut_17/dev.txt.tmp
curl -L 'https://raw.githubusercontent.com/leondz/emerging_entities_17/master/emerging.test.annotated' | tr '\t' ' ' > data_wnut_17/test.txt.tmp

ls data_wnut_17/  # (train|dev|test).txt.tmp が取得できていることを確認

プレ処理として128トークンを超える文章は分割する

export MAX_LENGTH=128
export BERT_MODEL=bert-base-cased  # トークン数を数えるのでトークナイザの指定が必要
python scripts/preprocess.py data_wnut_17/train.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/train.txt
python scripts/preprocess.py data_wnut_17/dev.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/dev.txt
python scripts/preprocess.py data_wnut_17/test.txt.tmp $BERT_MODEL $MAX_LENGTH > data_wnut_17/test.txt

ls data_wnut_17/  # プレ処理済みの (train|dev|test).txt が生成されていることを確認

ラベルを収集する

cat data_wnut_17/train.txt data_wnut_17/dev.txt data_wnut_17/test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > data_wnut_17/labels.txt

vi data_wnut_17/labels.txt  # 6固有表現×(B+I)+O=13ラベルのリストが生成されていることを確認

設定ファイルを用意する

vi wnut_17.json  # 以下をコピペするだけ

{
    "data_dir": "./data_wnut_17",
    "labels": "./data_wnut_17/labels.txt",
    "model_name_or_path": "bert-base-cased",
    "output_dir": "wnut-17-model-1",
    "max_seq_length": 128,
    "num_train_epochs": 3,
    "per_device_train_batch_size": 32,
    "save_steps": 425,
    "seed": 1,
    "do_train": true,
    "do_eval": true,
    "do_predict": true,
    "fp16": false
}

モデルを学習する

python run_ner.py wnut_17.json
# 学習が走る

vi wnut-17-model-1/test_predictions.txt  # テストデータへの予測結果が出力されていることを確認

補足

WNUT’17 について

固有表現抽出タスクでも、レアなエンティティを含むタスクとのことです。
WNUT’17 Emerging and Rare Entities task | noisy-text.github.io
ラベリングされている固有表現は6種類あります。

Person	人名（架空の人名でも可）。
Location	地名（架空の地名でも可）。
Corporation	会社名（場所や商品を指している場合は Location, Product へ）。
Product	商品名（架空の商品名でも可）。
Creative work	創作物（映画、本、歌など）の名前。
Group	グループ名（スポーツチーム、バンドなど）。会社名は Corporation へ。

train.txt.tmp の一番最初の文章をみると、以下のように地名がラベリングされています。

@paulwalk It's the view from where I'm living for two weeks. Empire State Building = ESB. Pretty bad storm here last evening.

設定ファイルについて

以下のファイルの引数が指定できます。
transformers/training_args.py at master · huggingface/transformers · GitHub

モデルについて

学習済みの bert-base-cased を利用した BertForTokenClassification モデルで各トークンを分類しています。
https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_bert.py#L1447-L1530
　　入力文章（トークンIDの列）
　　→［BERT］
　　→ 768 次元の特徴ベクトルの列（※ bert-large-cased なら 1024 次元）
　　→［ドロップアウト］
　　→［全結合］
　　→ 13次元の出力ベクトルの列（※ 活性化前）
損失は正解ラベルとの torch.nn.CrossEntropyLoss によります。

学習について

AdamW で学習します。
https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L583
→ https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L412
→ https://github.com/huggingface/transformers/blob/master/src/transformers/optimization.py#L219

2020-09-13