Project to train a evaluate a name entity recognition model to analyzing text-to-image prompts. The entities comprise 17 categories 7 main categories and 11 subcategories, extracted from a topic analysis made with BERTopic. The topic analysis can be explored the following visualization.
Specifier taxonomy
├── medium/
│ ├── photography
│ ├── painting
│ ├── rendering
│ └── illustration
├── influence/
│ ├── artist
│ ├── genre
│ ├── artwork
│ └── repository
├── light
├── color
├── composition
├── detail
└── context/
├── era
├── weather
└── emotion
Prompt data are from the diffusionDB database and were annotated by hand using Prodigy.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
spaCy projects documentation.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
install |
Install dependencies, log in to Hugging Face and download a model |
preprocess |
Convert the data to spaCy's binary format |
pretrain |
Pretrain the embedding on all prompts |
train |
Train a named entity recognition model |
evaluate |
Evaluate the model and export metrics |
package |
Package the trained model so it can be installed |
push_to_hub |
Push the model to the Hub |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
preprocess → pretrain → train → evaluate → package → push_to_hub |
The following assets are defined by the project. They can
be fetched by running spacy project assets
in the project directory.
File | Source | Description |
---|---|---|
assets/diffusiondb_raw_prompts.jsonl |
URL | JSONL-formatted of all diffusionDB prompts |
assets/ner_prompting.jsonl |
URL | JSONL-formatted development data exported from Prodigy, annotated with 17 entities |