Create test cases for the Tokenizer #27

luke-carlson · 2024-11-01T14:44:54Z

We should create a test case that loads a file, creates a tokenizer https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py and then we assert that the tokenizer was created

aaliyahnl · 2024-11-02T15:57:29Z

working on this with @pedroborgescruz

pedroborgescruz · 2024-11-07T21:06:35Z

Hi, @luke-carlson! We started working on this issue today. We made progress, but would like to ask you some questions before continuing:

(1) What is the difference between a .vocab file (as defined in the parameter of https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/factory.py#124) and a token_file (as defined in the parameter of the functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#45)?
(2) We noticed there are different "modes" for each of the tokenizer functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py. For the function read_dictionary(token_file) in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#78, would it expect to receive any .vocab file that's not either of type BERT or t5, e.g. imagenet.vocab?
(3) We have started to write our tests for the tokenizer file. However, we are unsure of how to run the assert tests. We thought of two different cases and were wondering if you could shed some light. Which case below, if any, do you think would be the appropriate way to assert if the tokenizer was created?

CASE 1: We compare factory.py's create_tokenizer() to tokenizer.py's output. Ex.: assert create_tokenizer() == read_dictionary_bert().
In this case, we assume token_file and vocab_file refer to the same file. Consequently, we can compare their outputs for the final assert.
CASE 2: We run the output of factory, and then pass its output as the token file to tokenizer.py's functions. Then, we need something else to assert against. Ex.: assert read_dictionary_bert(create_tokenizer(.vocab)) == ?
In this case, we assume that token_file and vocab_file do not refer to the same file. In this case, we are not sure what to compare for the final assert.

luke-carlson · 2024-11-13T13:37:59Z

Hey Pedro

looks like we ended up using two different names to refer to the same thing — we should probably rename them to both use vocab_file as the variable so that it matches the cli.
just those vocab options for now
the first case is a reasonable one to make, that'll already increase code coverage of the tokenizer.py file

pedroborgescruz · 2024-11-16T19:34:38Z

luke-carlson added the good first issue Good for newcomers label Nov 1, 2024

aaliyahnl mentioned this issue Nov 14, 2024

renaming instances of token_file to vocab_file #37

Open

pedroborgescruz mentioned this issue Nov 16, 2024

Tests for tokenizer.py #41

Open

Provide feedback