Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create test cases for the Tokenizer #27

Open
luke-carlson opened this issue Nov 1, 2024 · 4 comments
Open

Create test cases for the Tokenizer #27

luke-carlson opened this issue Nov 1, 2024 · 4 comments
Labels
good first issue Good for newcomers

Comments

@luke-carlson
Copy link
Collaborator

We should create a test case that loads a file, creates a tokenizer https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py and then we assert that the tokenizer was created

@luke-carlson luke-carlson added the good first issue Good for newcomers label Nov 1, 2024
@aaliyahnl
Copy link
Contributor

aaliyahnl commented Nov 2, 2024

working on this with @pedroborgescruz

@pedroborgescruz
Copy link
Contributor

pedroborgescruz commented Nov 7, 2024

Hi, @luke-carlson! We started working on this issue today. We made progress, but would like to ask you some questions before continuing:

(1) What is the difference between a .vocab file (as defined in the parameter of https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/factory.py#124) and a token_file (as defined in the parameter of the functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#45)?
(2) We noticed there are different "modes" for each of the tokenizer functions in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py. For the function read_dictionary(token_file) in https://github.com/apple/ml-mdm/blob/main/ml_mdm/language_models/tokenizer.py#78, would it expect to receive any .vocab file that's not either of type BERT or t5, e.g. imagenet.vocab?
(3) We have started to write our tests for the tokenizer file. However, we are unsure of how to run the assert tests. We thought of two different cases and were wondering if you could shed some light. Which case below, if any, do you think would be the appropriate way to assert if the tokenizer was created?

  1. CASE 1: We compare factory.py's create_tokenizer() to tokenizer.py's output. Ex.: assert create_tokenizer() == read_dictionary_bert().
    In this case, we assume token_file and vocab_file refer to the same file. Consequently, we can compare their outputs for the final assert.

  2. CASE 2: We run the output of factory, and then pass its output as the token file to tokenizer.py's functions. Then, we need something else to assert against. Ex.: assert read_dictionary_bert(create_tokenizer(.vocab)) == ?
    In this case, we assume that token_file and vocab_file do not refer to the same file. In this case, we are not sure what to compare for the final assert.

@luke-carlson
Copy link
Collaborator Author

Hey Pedro

  1. looks like we ended up using two different names to refer to the same thing — we should probably rename them to both use vocab_file as the variable so that it matches the cli.
  2. just those vocab options for now
  3. the first case is a reasonable one to make, that'll already increase code coverage of the tokenizer.py file

@pedroborgescruz
Copy link
Contributor

PR: #41
@luke-carlson

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants