Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12: Improved Entity Matching #42

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions 012-improve-entity-matching.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# OSEP #12: Improved Entity Matching

| | |
|--------------------|----------------------------------------------------------------|
| **Author(s)** | @newageairbender |
| **Implementer(s)** | @newageairbender, @jessemortenson, @alexobaseki |
| **Status** | Draft |
| **Issue** | https://github.com/openstates/enhancement-proposals/issues/TBD |
| **Draft PR(s)** | https://github.com/openstates/enhancement-proposals/pull/TBD |
| **Approval PR(s)** | https://github.com/openstates/enhancement-proposals/pull/TBD |
| **Created** | 2024-07-01 |
| **Updated** | 2024-07-31 |

---

## Abstract

With the 2024 New Session, we had far more eyes on Events & Votes as well as our usual Bill activity. Working through
bug tickets, it became evident that there was only so much we could do for some scrapers but some missing data could be
traced back to lack of proper matching. This EP is to start improving the matching by passing in data that would narrow
the query results returned on import.


## Specification

### People Matching on Sponsorship, Votes, & Events
To help resolve People mismatching, there is already an option to pass in an `org_classification` to the
NewAgeAirbender marked this conversation as resolved.
Show resolved Hide resolved
[resolve_person](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/base.py#L526)
function on the `BaseImporter` that is used to query & match People to Bills, Events, & Votes. If the
`org_classification` isn't set, it just defaults to any match of `upper`, `lower`, & `legislature`. If we ensure
that an `org_classification` can be passed in from where it's used in the Bill, Event, & Vote importers, we should be
able to alleviate some of that mismatching. There may need to be some scraper updates to ensure that the classification
is correct, like a Bill getting sponsors added from the opposite chamber than it was introduced in, but for Votes where
the voting body is either a Chamber or a Committee, we can narrow down People by classification based off of that voting
body with more accuracy. Because of this, we should start with adding the `org_classification` to Events & Votes before
tackling Bills.

When we get to Bills, `chamber` is already a passable value on [add_sponsorship](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/scrape/bill.py#L105),
so it'll be mostly scraper work to ensure that the correct chamber is being passed in per sponsorship. For example,
scrapers should be updated to include logic around if Representative or Senator is listed on the Sponsor's name to
designate chamber or where House vs Senate have grouped names like in [IL](https://ilga.gov/legislation/BillStatus.asp?DocNum=4910&GAID=17&DocTypeID=HB&LegId=152782&SessionID=112&GA=103),
we can be certain on chamber to pass in for`org_classification`, etc.

We also should consider adding nicknames of People to `other_names` in the yaml files through the People script so we
can catch matches when the name may not be exactly as scraped if the person goes by multiple first names or includes
their middle name/initial in some places to differentiate from people with other names.

#### Solutions:
- Core: Adding `org_classification` to Events & Votes from where `resolve_person` is being used on Import based on data
provided on the scrape
- Core: Add `org_classification` to Bill Import for Sponsors, but may need to be after scraper improvements if
jurisdictions have sponsors from both chamber per Bill
- Scrapers: Ensure correct `chamber` is passed in with `add_sponsorship` on Bill Scrapes
- People Script: Update People Script to include name values that may be overwritten as `other_name` options
- People Repo: Add `other_name` values that match scraped name formats for sponsorship or votes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we intend to do this? Maybe using the people matching tool? Explaining how we will arrive at this will be useful.


### Committees as Bill Sponsors
In resolving Committees as Bill Sponsors, there's logic that should be able to match in the `BillImporter`'s
[prepare_for_db](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/bills.py#L147)
function, so need to ensure that scrapers are checking if the Sponsor is a Person or Organization & make sure that is
being correctly passed in as the `entity_type` in `add_sponsorship()`. The only fix needed is in the scrapers themselves.

#### Solution:
- Scrapers: Ensure correct `entity_type` is passed in with `add_sponsorship` on Bill Scrapes (just need to check which
states have unmatched People that are actually Committees)

### Committees on Events
Similarly, in helping resolve Committees, we can improve the matching query by cleaning or splitting up the scraped name
into it's different Committee elements such as Chamber & Type and then incorporating that into the `OrganizationImporter`
NewAgeAirbender marked this conversation as resolved.
Show resolved Hide resolved
[limit_spec](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/organizations.py#L11)
logic. This will be a bit messier, so I nominate that we add `other_names` to Committee files to more easily match up
against what is commonly scraped like we did [for MN](https://github.com/openstates/people/pull/1442/files) when Events
were "missing" because of name mismatching & update the `limit_spec` logic to check for more than the first `other_name`
string. This is the preferred route since we can update the Committee script to include the other formats
of the name without work from Engineering & Product to write to hundreds of files & we can incorporate multiple name
formats easily to accommodate however the source may be posting the Committees (ex: 'Committee on Ending Homelessness'
as a Bill Sponsor vs 'House Ending Homelessness' on Events, etc.)
NewAgeAirbender marked this conversation as resolved.
Show resolved Hide resolved

Currently, the `limit_spec` function is used to overwrite the Django default to limit the query parameters. As of right
now, the function:
- If classification is NOT party, then add the jurisdiction_id to the query spec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This steps is not terrible clear to me. Like what is "Django default", which classification is NOT party, which entity's jurisdiction_id are we adding to the query spec? A little explanation or link to where this changes will be happening would be helpful.

- if name is set, match on (the rest of the spec) AND (first other_names value matches name) OR (name is exact match)
- if name is NOT set, then just match on rest of spec

IF we go the `other_name` route, the change we'd need to make is:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets keep other names consistent across board. I see it is other_names in code

- If name is set, match on (the rest of the spec) AND (~~first~~ANY other_names value matches name) OR (name is exact match)

IF we wanted to split up by chamber & type first in `core`, we'd have to add:
- Update [add_participant](https://github.com/openstates/openstates-core/blob/7ac7b73bbb0956f7a539128f9186929509c19550/openstates/scrape/event.py#L140)
and `add_committee` to accept a `chamber` value or `committee_type` of `committee` or `subcommittee` (if `subcommittee`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a link to add_committee or this is going to be a new function?

add `parent_committee_id`)
- Add that `chamber` value to the `self.org_importer.resolve_json_id` calls in the `EventImporter` on lines [92](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/events.py#L92)
and 101
- In `limit_scope` if classification is `committee`, then add the `chamber_id` to query spec

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me for my ignorance, is this limit_scope or limit_spec.

- In `limit_scope` if classification is `committee`, then add the `committee_type` to query spec
- In `limit_scope` if classification is `committee` AND `committee_type` = `subcommittee`, then add the
`parent_committee_id` to query spec

#### Solutions:
- Core: Fix `limit_spec` on the `OrganizationImporter` so that more than just the first string in `other_names` is checked for
Committees
- People Script: Update Committee Script to include `other_names` for Committees that include Chamber, Type, & Both

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming chamber is like House, Senate, Joint. What is Type and Both?


### Bill Matching to Event Agenda Items
When it comes to matching Bills to Agenda Items on Events, I'm a little more fuzzy. Right now we have a [resolve_bill](https://github.com/openstates/openstates-core/blob/ac8e53aefe2a70d8ff360fc8b641bf77f28e2d7c/openstates/importers/base.py#L164)
function on the `BaseImporter` that attempts to match Bills via `bill_id`, `jurisdiction_id`, & `date` if it gets passed,
which seems like it could be improved by incorporating some of the logic in `resolve_related_bills` that Jesse worked on
this spring where the match query is also narrowed down by `session_id`. We can certainly pass in more data to try to
identify the Bill match better, but could also incorporate a LLM so will be testing out different approaches.
NewAgeAirbender marked this conversation as resolved.
Show resolved Hide resolved

#### Solutions:
- Scrapers: Ensure `bill_identifier` matches the format of the expected Bill per jurisdiction
- Core: Bill Identifier match improvements, passing in more data (at least `session`, maybe `chamber`)
- Core: Add LLM to try better matching with above Core improvement
- Core: Potentially cli command to try matching Events with Unmatched Bills in their agendas to Bills post-import

## Rationale

### Bills or Votes to People or Committees
We've known that matching Bills or Votes to Sponsors has been tricky for a while, hence OSEP #3 to help alleviate some
of the issues with mismatching legislators. The People Matcher Tool can only get us so far, since we run into a blocker
when there are legislators with the same last name in a jurisdiction or the sponsor is actually a committee, where
adding an `other_name` to a person's yaml file isn't a possible fix.

Current example for matching a Person to a Bill Sponsor:
- Bill scraper calls `add_sponsorship` passing in { "name": "JOHNSON", entity_type="person", "classification"="primary",
"primary"=True }
- `add_sponsorship` creates a `pseudo_person_id` that is JOHNSON
- BillImport calls `resolve_person` passing in that `pseudo_person_id` with start/end date values from the Bill's `session`
- [resolve_person](https://github.com/openstates/openstates-core/blob/7ac7b73bbb0956f7a539128f9186929509c19550/openstates/importers/base.py#L526)
constructs a spec that is used to compose filters to query data from the Person model to find a match. Could pass in
`org_classification` but currently don't to narrow down via chamber
- If jurisdiction has more than one legislator with the last name "Johnson", Importer will give an error message that
`multiple people returned for spec` but continue through Import task

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am imagining that "multiple people returned for spec" will be limited if you have organization classification in the resolve person query.

Thinking of an idea here:
If there are can we have these type of mismatch save somewhere in the database or a file that can be used in matching tool. I envision a matching tool that is broader than just the unmatched name and list of all the person. Have an idea of the multiple object returned can help quickly resolve the mismatch if and when they happen. This might be a good way to also gather data of mismatch. Some like

unknown_entity | possible_matches | is_resolved |
JOHNSON.         | JOHNSON, Bill.      | False.          |
JOHNSON.         | JOHNSON, Mike.   | False.          |


### Events to Committees
A similar issue has been happening with matching Events to their Participants (typically a Committee). The scraped name
of a participant can vary from vague things such as "Rules" with no chamber, or more specific like "Assembly Privacy and
Consumer Protection Committee" but name of the Committee doesn't have the chamber listed on the yaml file. Now that
we've come to a standard expectation for the OS People repo that Committees will just be the name without chamber &
committee type since those are able to be derived from data in the yaml file, this should make it easier to match with
if we can narrow the match query based on those attributes.

### Events to Bills
Another area where we're struggling to match entities is Events to the Bills listed in their Agenda Items. Sometimes
it's clearly because the scraped bill id format is different from how the Bill gets saved, but sometimes it's less clear
as to why some Bills get matched but others don't. Occasionally, there may be a Bill that doesn't exist in OS yet but
is mentioned as an Event's Agenda Item, so it won't be attached to the Event until after a future scrape after the Bill
is in the system.

## Drawbacks

Should absolutely add defaults if we're not certain what's going to be passed in on `core` updates.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?


## Implementation Plan
Most are listed above with the entity types they fix, but other plans included below

#### Setup
- Pull numbers for average percent matched per data type, also broken down per jurisdiction
- Create harnesses to try & limit testing scope per data type. Can include bug tickets for specific jurisdictions
- Create shared database for running tests on improvements
- Insights team tests to see if we can use AI to help match more entities

## Copyright

This document has been placed in the public domain per the [Creative Commons CC0 1.0 Universal license.](https://creativecommons.org/publicdomain/zero/1.0/deed)