Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Protocol of adding a new model (Stella_en_<*>_v5 family) implementation with Candle #2525

Closed
AnubhabB opened this issue Oct 1, 2024 · 10 comments · Fixed by #2551

Comments

@AnubhabB
Copy link
Contributor

AnubhabB commented Oct 1, 2024

Hi,

I have a working implementation of Stella_en_<*>_v5 family of models which is one of the top ranking model in the MTEB leaderboard for reranking and retrieval.

It's basically built on top of the candle-transformers::qwen2 implementation with the language modeling head swapped for their pre-trained dense layer.

I was hoping to open a pull request with candle along with an example.

Questions:

  1. Would it be OK?
  2. If it is, is there a protocol to follow for candle_transformers? Something akin to How to add a model to transformers
  3. For embedding models are there any candle standard API implementations required? My implementations just spits out the logits from the forward pass.

Looking for some guidance. Thanks in advance.

CC:
@EricLBuehler @LaurentMazare

@LaurentMazare
Copy link
Collaborator

Adding popular models such as these ones is certainly fine (as long as the license permit it, e.g. yolo-v8 directly in candle-transformers or llama 3.2 vision would be problematic).
There isn't a very formal process for this, the best is probably to start by looking at similar models and how they are implemented and do it in the same style. If you go along this way and want to formalize it in a "How to add a model" document, the contribution would be welcome.
There is no standard api provided in candle-transformers just spitting the logits seems fine as long as there is an example showing how to use them.

On best practices:

  • Ideally one should follow the python reference implementation if it exists and have links in the code to the different part of this reference implementation.
  • The code should be validated on some examples against the reference implementation and the logits should not be far apart (usually better to use f32 than bf16 for this).
  • It's good to have an example with a readme that links to the model card and gives some input/output samples.

@AnubhabB
Copy link
Contributor Author

AnubhabB commented Oct 1, 2024

Thanks a ton for the elaborate response @LaurentMazare, I'll take a stab at it after verifying the license.

@iskng
Copy link
Contributor

iskng commented Nov 8, 2024

Am I missing something, the smaller stella_400M docs say its based off the same model but the config lists different activations functions and swapping of the model address produces:
Error:` Tokenizer doesn't contain expected `<|endoftext|>` token

@AnubhabB
Copy link
Contributor Author

AnubhabB commented Nov 8, 2024

I don't think you are missing anything .. I was stumped there too. The 1.5B variant and 400M variant are not just config changes. I've not had the chance to work on the 400M variant yet for Candle transformers. Candle transformers support the 1.5B variant for now.

@iskng
Copy link
Contributor

iskng commented Nov 8, 2024

Ah thanks for the heads up!

If you dont mind what was your process for the 1.5B?
Did you start by converting the modeling_qwen.py then just implement the task specific queries and the tokenizer.

The 400 has a modeling.py that Iv got converted and outputting embeddings but not matching the sample and they're way off.

@AnubhabB
Copy link
Contributor Author

AnubhabB commented Nov 8, 2024

Yes, thats what I did .. picked up the base Qwen model Candle implementation, removed everything and kept adding one module at a time while validating the output with HF Transformers implementation.

Just noting down a few misses I had during the phase you are in that resulted in output mismatch:

  1. The tokenizer padding - it's padding left not right the way most embedding models do, I had missed this and ended up having similar but not same outputs
  2. Check if you can reproduce the results reported by the authors - during AutoModel I missed trust_remote_code=True - gave me a lot of headache before I figured it out.
  3. The file you should look at is modeling.py - with trust_remote_code=True this is the file which is getting executed
  4. float32 seemed to be important for stability for the final embedding projection layer - with bf16 I had NaNs IIRC

Some of the above may be obvious .. these are just mistakes I made.

By the way .. a PR with 400M would be welcome. Would very much love to hear about the progress you make.

@iskng
Copy link
Contributor

iskng commented Nov 8, 2024

The tokenizer padding_side comes from tokenization_qwen.py right but the 400M doesn't use that as far as I can tell.
AutoTokenizer has trust_remote_code=True but theres no specific .py file for the 400M.

Seems like more based on gte-large-en-v1.5

@AnubhabB
Copy link
Contributor Author

AnubhabB commented Nov 8, 2024

The tokenizer config actually comes from tokenizer_config and yes they are using padding: right here unlike 1.5B variant and looks like they are using BertTokenizer.

Most importantly, they define custom AutoModel and AutoConfig calls here which is referencing the modeling.py and config.py.

To implement this in candle we'll need to walk through and re-write starting with modeling.py -> class NewModel and config.py -> class NewConfig.

Checkout the linked tokenizer_config, that should give you relevant pointers.

By the way, for the stella_en_400M_v5 I don't see tokenization_qwen.py. Use this as your reference implementation instead of qwen.

@iskng
Copy link
Contributor

iskng commented Nov 9, 2024

Yea got which repo just wasn't clear on if there was an existing architecture reference in candle.

Ended up just starting from scratch, seems like it needs this custom rope implementation also.

Can get this output but not clear on how to hook in with the MLR layer.
For the moment just copied your mean pooling and stuck it on the end.
Similarity scores: [[0.8398, 0.2990],
[0.3282, 0.8095]]

@iskng
Copy link
Contributor

iskng commented Nov 9, 2024

Below's what I got so far, can reproduce the repos example and followed your 1.5B example.
Still a little unclear on what's needed to submit for the model. Doesn't include any of the training related dropouts and the config is hardcoded in the model. More of inference only for these specific weights.
Here's the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants