News Articles Search Engine

Project Description

This project periodically import news articles crawled from the internet by the Common Crawl. The imported new articles are searchable with keywords (highlighted in returned results in pages), publishing date, and language (English and Non-English).

This project consists of 4 parts:

An AWS Lambda function that periodically imports (checking every hour) the latest crawled news articles and sends parsed news to AWS Simple Queue Service (SQS)
An AWS Lambda function that retrieves parsed news from AWS SQS and post them to AWS Elasticsearch Service
Backend Search API
Frontend Search Website

When posting articles to AWS Elasticsearch service, the news HTML webpages collectively stored in a .warc.gz file are each parsed to the following fields in a JSON Object:

URL
Title
Text
Language
Publishing Date

The backend Search API is deployed to an Apache Tomcat environment managed by AWS Elastic Beanstalk. The frontend Search website is hosted in an AWS S3 bucket.

The complete code base is available upon request only due to academic integrity policy. A video demo that describes this project is available here.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Articles Search Engine

Project Description

About

Releases

Packages

Albert-Z-Guo/News-Articles-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

News Articles Search Engine

Project Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages