BERT Additional pretraining in TF-Keras

Question

I'm currently developing a project involving sequence multilabel classification. Since I'm using a highly technical dataset, I thought that doing additional pretraining on BERT before fine-tuning it for the classification part would be beneficial. But I can't find any guide to use Huiggingface transformers and Keras together to pre-train the model. My idea is to pre-train the model on my dataset, then save it and load it again to fine-tune the classifier. Every come that I found is meant for PyTorch but I have to use TensorFlow. I have written this code so far:

from transformers import TFDistilBertForMaskedLM, AutoTokenizer, AutoConfig
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForMaskedLM.from_pretrained("distilbert-base-cased")

model.compile(optimizer="adam")
data = tokenizer(
    twenty_train.data[:10], 
    return_tensors="tf", 
    padding=True, 
    truncation=True, 
    max_length=tokenizer.model_max_length
)

Where do I go from here to fit my data to BERT? I know I should provide the model a masked input too but I can't understand where/how

user11530462 · Accepted Answer · 2021-12-21 05:03:22Z

You can use BERT model to pre-training on your custom dataset.

Sample working code

import os
import tensorflow as tf
import tensorflow_hub as hub

bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",trainable=True)

#get sentence embeddings
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

get_sentence_embeding([
    "How to find which version of TensorFlow is", 
    "TensorFlow not found using pip"]
)

def build_classifier_model(num_classes):

  class Classifier(tf.keras.Model):
    def __init__(self, num_classes):
      super(Classifier, self).__init__(name="prediction")
      self.encoder = hub.KerasLayer(bert_encoder, trainable=True)
      self.dropout = tf.keras.layers.Dropout(0.1)
      self.dense = tf.keras.layers.Dense(num_classes)

    def call(self, preprocessed_text):
      encoder_outputs = self.encoder(preprocessed_text)
      pooled_output = encoder_outputs["pooled_output"]
      x = self.dropout(pooled_output)
      x = self.dense(x)
      return x

  model = Classifier(num_classes)
  return model

test_classifier_model = build_classifier_model(2)
bert_raw_result = test_classifier_model(text_preprocessed)
print(tf.sigmoid(bert_raw_result))

Collectives™ on Stack Overflow

BERT Additional pretraining in TF-Keras

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related