Use pre-trained BERT weights with TorchSharp-defined BERT for NLP tasks #457

fwaris · 2021-11-21T17:03:45Z

fwaris
Nov 21, 2021

The goal was to:

a) Define a version of BERT in TorchSharp
b) Load the weights from BERT Tensorflow checkpoint and
c) Fine tune model for text classification

It was accomplished. The notebook is available in this repo. Output from a sample run is here: https://github.com/fwaris/BertTorchSharp/blob/master/saved_output.ipynb.

There were several hurdles to overcome:

How to define to define BERT correctly in TorchSharp
How to read the Tensorflow checkpoint data files
How to load the checkpoint weights correctly into TorchSharp version of BERT

To my surprise the easiest to overcome was defining the BERT model. Given that PyTorch/TorchSharp support 'transformers' directly, I was able to construct BERT easily - the code is less than 60 lines and quite easily read.

The hardest (for me) was reading the checkpoints which requires the understanding of levedb file format. This functionality is now available as a nuget package TfCheckpoint.

The mapping of weights from checkpoint to TorchSharp was tricky but not difficult. There are few googlies to watch out for.

It's expensive and time-consuming to train BERT (or other language models). The ability to use pre-trained weights makes NLP tasks much easier to perform in TorchSharp.

GeorgeS2019 · 2021-11-21T20:27:31Z

GeorgeS2019
Nov 21, 2021

@fwaris please look at these transformer diagrams created from GPT-2.ONNX using a .NET Winform application

What you have done is exciting. The community needs a way to reverse engineer the HuggingFace models ( tensorflow or ONNX) into TorchSharp models.

We need your feedback on the following questions:

How relevant to display the HuggingFace tensorflow model or ONNX (converted from Tensorflow or PyTorch) into some form of Hierachical transformer layout that make it MORE INTUITIVE to reverse these models into TorchSharp ( if possible, semi automatic way).
As you have pointed out, ideally the hierarchical transformer layout will show where the TorchSharp's repeated TransformerEncoderLayer are organized.
Ideally using the improved version of the e.g. WinForm (Dendrite) application, one is able to edit the TorchSharp matching (exiting or custom) modules to the imported graph with the weights.
The improved version (Dendrite) can do TorchSharp code generation to create e.g.:
- Embedding Layer
- BERT Pooler
- BERT Model

This is a quick write up. Essentially we need a UI assisted reverse engineering of Hugging Face models to TorchSharp ( Model and weight transfer), to assist/empower less experienced users to contribute to the building up the .NET Hugging Face models.

If possible, we learn the best practice from Hugging Face. The .NET tokenizers can be provided by BlingFire. We will have a repository of various Tokenizers corresponding to what are available from Hugging Face. The e.g. WinForm application will download the right tokenizers and provide the code generation for the encoding and decoding.

Obviously, it does not have to be Winform, it could also be a .NET library that can generate the needed diagrams for .NET interactive or other UI interface ( Web: Blazor or WPF/UWP).

What you have done is to bring us closer to the vision. It is about the time that we pull the community effort in having a .NET (community driven) hugging face solutions.

2 replies

fwaris Nov 22, 2021
Author

@GeorgeS2019 I agree graphical representations can help understand models better as sometimes the code can be very hard to read.

As you mention, many of the models now use a repeating 'block' or cell. It is important to understand the core cell to get the translation right. Graphical representation, reversed engineered from binaries, shows the true structure of the model , however this is only true for static graph frameworks e.g. Tensorflow.

PyTorch and JAX stitch the components together dynamically using code; the model structure may not be static. There may be additional processing hidden in the 'forward' functions that is not apparent in the hierarchical structure of the components.

To translate such models, you have to look at the code - either as a human or in some automated fashion. I think ONNX tries to translate the code but is not successful in all cases. As models become more complex this will become harder.

Having said that, automated processing may be feasible for ONNX and Tensorflow models. For PyTorch and JAX models one may have to manually translate the models - for now. (I don't know if formal semantics exist for Python).

I went through the Hugging Face code for BERT-based models. It is quite complex as it needs to handle many variations of BERT in a single code base. Also Hugging Face does not use the PyTorch's 'transformer' blocks. It builds everything up from basic components.

The TorchSharp BERT model becomes very simple when you use the pre-built transformer layers (it is even simpler than Google's own BERT model code).

I agree that supporting utilities such Tokenizers should be packaged into a common code base as these are expected to be widely reusable.

Hugging Face has around 17000 models! In reality we need a few core architectures (that can be flexibly configured) to perform most tasks well with TorchSharp. (I used just a 2-layer BERT to get high accuracy on Yelp Review classification).

Language models are getting harder to train as many require weeks for training (e.g. BERT 'base'). Some can't be trained on a single GPU and require TPUs or distributed training on multiple GPUs. Therefore, as a community, it is important to be able use pre-trained weights.

In summary, at this stage fully automated translation is not feasible but tooling will certainly help.

I plan to make the BERT model available as a package that should support the BERT variations and pre-trained weights available from Google. And maybe we can build similar packages for other popular architectures.

GeorgeS2019 Nov 22, 2021

@fwaris Thank you for taking your time for lengthy feedback. This is so appreciated. Netron does not support displaying PyTroch model. Why I find it interesting you approach the Tensorflow path and is open to ONNX path. There are limitations right now with ONNX. The TorchSharp team is considering replacing TorchSharp proprietory model with ONNX, a model format that is meant for inter-exchange. Like you have pointed out, Deep NLP is extremely Time challenging to train. Your approach makes sense and re-use of trained weight makes sense.

@fwaris I need your opinion. What aspect of transformer architecture underlies the improvement in Deep NLP transformer performance. Having a graphical approach displaying those salient architecture features for .NET community will help grow more experience in these Deep NLP transformer models.

For GPT-2, it seems the exponential growth of the no of parameters or represented for more complex stacking of these reusuable blocks. It is unclear how this is reletad to BERT. For .NET community, we do not need to follow what come from Python community, but explore new ways that could novel insight to help build the .NET transformer models.

For ONNX limitations. we will have new inputs: see this issue

I am of the opinion to extend the purpose of ONNX beyond inter-changeable format of graph and weight but also additional meta-data that assist in e.g. reverse enginerring or recronstruction of e.g. complex transformer network. How to best achieve that, first we need to brainstorm to understand what innovate the transformer network.

WE MUST NOT ASSUME, what is being DONE NOW is already the best state-of-art. We need to gather more experts, to explore what has not yet being tested and explored. For .NET community, we have the opportunity to do that ESPECIALLY when we are late comer to this scene.

The next agenda, we need to find ways to have more regular exchange and brainstorm.

We must not rush BUT put more thought into the design and the possibilities of achieving MORE than the status quo.

@fwaris Thanks again for your timely and BRAVE effort !

GeorgeS2019 · 2021-11-21T20:44:02Z

GeorgeS2019
Nov 21, 2021

@mfagerlund I know your hands are full NOW. However, if you have time to spare end of the year or beginning of next year, do put a bit of your brain power into this (VERY COMPLEX, but requires genius design considerations) overdue .NET Deep NLP vision :-)

4 replies

mfagerlund Nov 21, 2021

I may, but I make no promises, I have a lot of projects I'm working on at the moment - I particularly like to create my own ML models to train ;)

GeorgeS2019 Nov 21, 2021

@mfagerlund Imagine the possibility of having an AI avartar speaking intelligent back to you. You get to train this AI just like the latest Tom Hank movie: Finch. No promise is needed. Occasionally you have some idea to feedback, that is good enough.

fwaris Nov 23, 2021
Author

@GeorgeS2019 there are two interrelated aspects of transformers that have driven advances in NLP a) the ability to process an entire sequence at once using positional encoding and b) the attention mechanism - best explained here. High-level graphical representation always helps in understanding but too much detail can distract. Also with PyTorch the graphical representation is not complete as a part of the structure is in code.

I have found that understanding of the math behind the models is unavoidable in some cases. A deep enough understanding of math is required to correctly translate models from one framework to another. As computer science people we mostly learned discrete math but here we need to grok continuous functions, fields and spaces. I had to invest much time to sort through some dense math in the last few years.

Also, unless primary deep learning research is done in .Net we can expect Python to be the dominant language of the field (even though F# is a better Python). Fortunately most new research (60%) is with PyTorch, which should be relatively easy to translate and use in .Net via TorchSharp. .Net is much better suited than Python for software engineering and high-performance applications.

I have briefly worked with ONNX but certainly not an expert. My understanding is that ONNX is all structure (i.e. not code) and thus should be translatable to another form more easily.

As most things, my work to port pre-trained BERT to TorchSharp came out of necessity. I had a need to classify cryptic English sentences. However, there is too much corporate jargon in the sentences for NLP tools to be effective. I found that Microsoft.ML + TorchSharp was better at this task than BERT-TorchSharp. In particular, the character tri-grams transforms (here and here) performed the best. Note the number of classes I had to predict is in the 1000s. This is a very hard task for classical ML algorithms (i.e. those in ML.Net) but easily handled in PyTorch/TorchSharp so a combination of ML.Net and TorchSharp was needed.

GeorgeS2019 Nov 23, 2021

@fwaris today, my time is a bit tight. Too tight to properly feedback. Good stuff what you have written.

ChengYen-Tang · 2023-03-20T12:59:20Z

ChengYen-Tang
Mar 20, 2023

I was assigned a new project today, the new project is about the research and application of NLP😥
If we want to do nlp research with dotnet, we need to port transformer to dotnet

0 replies

fwaris · 2023-03-20T13:58:51Z

fwaris
Mar 20, 2023
Author

The basic transformer module exists in TorchSharp today, however NLP with transformers is a large subject area these days - what with ChatGPT, GPT-3, GPT-4 etc.
I suggest you get a feel for this domain by reading the appropriate papers from OpenAI.

Its hard for individuals (or even small companies) to train models like ChatGPT due to the infrastructure required. ChatGPT further uses reinforcement learning with human supervision and, as far as I know, that data is not public.

If you want state-of-the-art results, I suggest that you use the APIs provided by OpenAI or Microsoft.

If you want to train a generative language model from scratch in dotnet, see this code repo. It has the code for defining a GPT-3 style model and training it with a small corpus.

1 reply

ChengYen-Tang Mar 20, 2023

Anyone want to copy him to dotnet? Hahaha
https://github.com/togethercomputer/OpenChatKit

GeorgeS2019 · 2023-03-20T22:11:17Z

GeorgeS2019
Mar 20, 2023

@ChengYen-Tang

@fwaris has done much for GPT, U need to bring that from f# to c#

0 replies

GeorgeS2019 · 2023-03-21T05:08:46Z

GeorgeS2019
Mar 21, 2023

@ChengYen-Tang

ChatGPT further uses reinforcement learning with human supervision and, as far as I know, that data is not public.

You already know Deep Reinforcement learning, try understand https://github.com/fwaris/LangModel and make it available from F# to C# if you have time

0 replies

GeorgeS2019 · 2023-03-23T12:23:31Z

GeorgeS2019
Mar 23, 2023

@ChengYen-Tang
What is your plan?

0 replies

GeorgeS2019 · 2023-03-23T12:51:23Z

GeorgeS2019
Mar 23, 2023

@fwaris

I do not know enough time to examine how to port F# https://github.com/fwaris/LangModel to c#.

Since we do not YET have TorchText for .NET , I wonder how to re-purpose many projects you have written for NLP, NLG in f# to c#, with specific interest to initiate a TorchText for .NET

Best approach for designing F# libraries for use from both F# and C#

F# - C# Interop

0 replies

fwaris · 2023-03-24T14:19:35Z

fwaris
Mar 24, 2023
Author

The core models are written in TorchSharp.Fun - a thin functional wrapper over TorchSharp.Net. TorchSharp.Fun models are easily convertible to C#. There are two basic cases:

F# Case 1:

Linear(100,100) ->> Dropout(0.1) ->> ReLU()

Equivalent C# Case 1:

new Sequential( ("lin1",Linear(100,100)), ("drp1",Dropuout(0.1)),("rel1",ReLU()));

F# Case 2:

let lin1 = Linear(100,100)
let drp1 = Dropout(0.1)
let rel1 = ReLU()

let myModel = F [] [lin1; drp1; rel1] (fun t -> t --> lin1 --> drp1 --> rel1 )

Equivalent C# Case 2:

public class MyModel : Module<Tensor, Tensor> {
  private Module<Tensor,Tensor> lin1 = Linear(100,100);
  private Module<Tensor,Tensor> drp1 = Dropout(0.1);
  private Module<Tensor,Tensor> rel1 = ReLU();

  public MyModel(.string name) : base(name) {
      RegisterComponents();
  }

  public override Tensor forward(Tensor input) {
      var l1 = this.lin1.forward(input);
      var d1 = this.drp1.forward(l1);
      return this.rel1.forward(d1);
   }
}

Opinion

Data science work requires formulating scores of small hypotheses (re model/parameters/features) and associated experiments to prove/disprove each. This is iterative work that is best done interactively. I use F# interactive (REPL) for model development. TorchSharp.Fun was created so as to quickly iterate on model structure in F# script code - it reduces boilerplate.

Also, today, data scientists should be polyglots. I routinely use Python and Scala (Spark) at work - in addition to F#. If you are doing data science in .Net then its worthwhile learning F#.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use pre-trained BERT weights with TorchSharp-defined BERT for NLP tasks #457

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Use pre-trained BERT weights with TorchSharp-defined BERT for NLP tasks #457

Replies: 9 comments · 7 replies

fwaris Nov 22, 2021 Author

fwaris Nov 23, 2021 Author

fwaris Mar 20, 2023 Author

fwaris Mar 24, 2023 Author

F# Case 1:

Equivalent C# Case 1:

F# Case 2:

Equivalent C# Case 2:

Opinion

Replies: 9 comments 7 replies

fwaris Nov 22, 2021
Author

fwaris Nov 23, 2021
Author

fwaris
Mar 20, 2023
Author

fwaris
Mar 24, 2023
Author