-
Notifications
You must be signed in to change notification settings - Fork 2
Blacklist Whitelist Improvements
To increase precision and filter all names, we decided to modify the number of common words that the blacklist (and whitelist) contain. The results are linked in the google sheet below:
Blacklist/Whitelist Process Results
I. Remove all common words from BL A) most common 1k, 5k, 10k, 20k words in English B) Common words = defined by unigram frequency in UCSF notes
II. Run full pipeline with new BL III. Rank FN names by number of occurrences IV. Examine context for top-N names using grep etc V. Write regex for top-N names
..... To further improve upon I-V ........
VI. (Do something with WL?)
Deliverables:
-
Remove BL entirely and report precision baseline
-
Remove 'common words' until precision until approaching baseline or > 80%, whichever first. Report precision at 1k, 5k, 10k, 20k
-
Steps III-V above. (Use exclusionary regex to compensate for the reduction of the BL). Stop when recall > 99
-
As group consider WL options IF above is unable to reach our goals. Can also consider capitalization options.Also consider repeating using UCSF unigram frequency instead of English word unigram frequency for blacklist subtractions and/or whitelist additions. 4a) group meeting on meta patterns including order in which regex are implemented in the algorithm.
-
Do magical things with locations.
- Randomly sample 500 notes with replacement from the combined batches 102/110 (1000 notes total)
- Run 4 separate experiments with 500-note sample:
- a. Run NNP through original blacklist
- b. Run NNP through new blacklist
- c. Run NNP and NNS through original blacklist
- d. Run NNP and NNS through new blacklist
Blacklist key:
- original blacklist: SS first names + census last names
- new blacklist: original blacklist + (UCSF Names - Entire Whitelist)
- Compile a list of all false positives generated by running NNP + Original Blacklist, ranked by decreasing FP count.
- Create a set of new blacklists by subtracting increasing percentages of the FP list from the original blacklist (1%, 2%, 3%, 4%, 5%, and by incerements of 5% until 100% is reached).
- Create a set of corresponding config files (using the NNP + original blacklist config as a base) that differ from each other only by the blacklist used.
- Run all new configs on:
- The same set of 500 randomly sampled notes used in Approach 2.
- The remaining 500 notes in combined batches 102/110.
- All 1000 notes from batches 102/110 combined.
- Pick new blacklist that optimizes for both recall and precision.
- Write safe regex for as many remaining FPs as possible.
- Write unsafe regex for names removed from the blacklist.
a. Determine where "troublesome" names are in 20k common words list
- Subtract common words (20k?) from names blacklist
- Subtract remainder from whitelist
The results of this analysis can be found here.