I'm currently developing a project involving sequence multilabel classification. Since I'm using a highly technical dataset, I thought that doing additional pretraining on BERT before fine-tuning it for the classification part would be beneficial. But I can't find any guide to use Huiggingface transformers and Keras together to pre-train the model. My idea is to pre-train the model on my dataset, then save it and load it again to fine-tune the classifier. Every come that I found is meant for PyTorch but I have to use TensorFlow. I have written this code so far:
from transformers import TFDistilBertForMaskedLM, AutoTokenizer, AutoConfig
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForMaskedLM.from_pretrained("distilbert-base-cased")
model.compile(optimizer="adam")
data = tokenizer(
twenty_train.data[:10],
return_tensors="tf",
padding=True,
truncation=True,
max_length=tokenizer.model_max_length
)
Where do I go from here to fit my data to BERT? I know I should provide the model a masked input too but I can't understand where/how