This package contains functions to scrape, parse, and analyze the Congressional Record. It provides tools to extract metadata from congress.gov and download the text of the Congressional Record.
- Scrape metadata from congress.gov
- Download text of the Congressional Record
- Analyze the Congressional Record (functionality to be implemented)
Scrapes metadata for all subsections of the Congressional Record for a specific date and section.
- Parameters:
date
: Date objectsection
: String (e.g., "senate-section", "house-section")
- Returns: DataFrame with metadata including headers and URLs to raw text
Downloads the raw text of each subsection as an .htm file.
- Parameters:
row
: DataFrame row containing metadataoutput_dir
: Directory to save the downloaded files
from datetime import datetime
from congressional_record_scraper import get_cr_df, get_cr_htm
# Scrape metadata
date = datetime(2007, 3, 1)
cr_metadata = get_cr_df(date, section="senate-section")
# Download raw text
for _, row in cr_metadata.iterrows():
get_cr_htm(row, "data/htm")
- Python 3.6+
- requests
- BeautifulSoup4
- pandas
- tqdm
By default, the script creates a data
directory to store:
cr_metadata.csv
: CSV file containing scraped metadatahtm
subdirectory: Contains downloaded raw text files
This scraper is designed to respect the terms of service of congress.gov. Please use responsibly and avoid overloading their servers with too many requests in a short time.