2

I'm currently developing a project involving sequence multilabel classification. Since I'm using a highly technical dataset, I thought that doing additional pretraining on BERT before fine-tuning it for the classification part would be beneficial. But I can't find any guide to use Huiggingface transformers and Keras together to pre-train the model. My idea is to pre-train the model on my dataset, then save it and load it again to fine-tune the classifier. Every come that I found is meant for PyTorch but I have to use TensorFlow. I have written this code so far:

from transformers import TFDistilBertForMaskedLM, AutoTokenizer, AutoConfig
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForMaskedLM.from_pretrained("distilbert-base-cased")

model.compile(optimizer="adam")
data = tokenizer(
    twenty_train.data[:10], 
    return_tensors="tf", 
    padding=True, 
    truncation=True, 
    max_length=tokenizer.model_max_length
)

Where do I go from here to fit my data to BERT? I know I should provide the model a masked input too but I can't understand where/how

1 Answer 1

1

You can use BERT model to pre-training on your custom dataset.

Sample working code

import os
import tensorflow as tf
import tensorflow_hub as hub

bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4",trainable=True)

#get sentence embeddings
def get_sentence_embeding(sentences):
    preprocessed_text = bert_preprocess(sentences)
    return bert_encoder(preprocessed_text)['pooled_output']

get_sentence_embeding([
    "How to find which version of TensorFlow is", 
    "TensorFlow not found using pip"]
)

def build_classifier_model(num_classes):

  class Classifier(tf.keras.Model):
    def __init__(self, num_classes):
      super(Classifier, self).__init__(name="prediction")
      self.encoder = hub.KerasLayer(bert_encoder, trainable=True)
      self.dropout = tf.keras.layers.Dropout(0.1)
      self.dense = tf.keras.layers.Dense(num_classes)

    def call(self, preprocessed_text):
      encoder_outputs = self.encoder(preprocessed_text)
      pooled_output = encoder_outputs["pooled_output"]
      x = self.dropout(pooled_output)
      x = self.dense(x)
      return x

  model = Classifier(num_classes)
  return model

test_classifier_model = build_classifier_model(2)
bert_raw_result = test_classifier_model(text_preprocessed)
print(tf.sigmoid(bert_raw_result))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.