News_summarization

The primary goal of news summarization is to provide readers with a quick understanding of the main ideas and essential details of a news story without requiring them to read the entire article. This is particularly valuable in today's fast-paced world, where individuals may have limited time to consume news but still need to stay informed about current events.

Techniques:

There are various techniques and approaches used in news summarization, including:

Extractive Summarization: In this approach, the summary is generated by selecting and extracting the most important sentences or paragraphs directly from the original article. These sentences are typically chosen based on criteria such as relevance, importance, and coherence.

Abstractive Summarization: Unlike extractive summarization, abstractive summarization involves generating summaries that may contain rephrased or paraphrased information from the original article. This approach often requires natural language processing (NLP) techniques and deep learning models to generate human-like summaries.

Hybrid Approaches: Some summarization systems combine elements of both extractive and abstractive techniques to produce more informative and coherent summaries.

How Run Project?

1. Clone Repository:

git clone https://github.com/pratik305/News_summarization.git

2. Create a Environment

conda create -n News python=3.10

3. Run Requirements.txt file

pip install -r requirements.txt

3. Run News_summarization jupyter file

for running jupyter file i use sagemaker studio lab (free version) you can use any jupyter notebook like colab etc. after running this it will create model folder and tokenizer folder (due to size limit i didn`t uploded in github

4. Run app.py

python app.py

Dataset

cnn-dailymail

The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

Data instance

For each instance, there is a string for the article, a string for the highlights, and a string for the id.

{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}

Data Fields

id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
article: a string containing the body of the news article
highlights: a string containing the highlight of the article as written by the article author

Data Split

Dataset Split	Number of Instances in Split
Train	287,113
Validation	13,368
Test	11,490

We can use it own dataset but it should be in above format

Model and Experiment Details

Model Used

The model utilized for this project is Google Pegasus from Hugging Face, which is a pre-trained model for abstractive text summarization.

Training Process

The model was fine-tuned using the CNN/DailyMail dataset in a supervised learning setup. We employed the Trainer API provided by Hugging Face, along with TrainerArguments for configuring the training process. | Step | Training Loss | Validation Loss | |-------|---------------|-----------------| | 500 | 1.631100 | 1.454633 |

Evaluation Metrics

We evaluated the model's performance using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which measures the overlap between the model-generated summaries and reference summaries.

Results

"rouge1": unigram (1-gram) based scoring
"rouge2": bigram (2-gram) based scoring
"rougeL": Longest common subsequence based scoring.
"rougeLSum": splits text using "\n"

Model	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
Pegasus	0.022725	0.001389	0.022731	0.022899

Discussion

The results indicate that the model performs reasonably well in summarizing news articles, particularly in capturing important content. However, there is room for improvement in enhancing the coherence and fluency of the generated summaries, as well as in addressing out-of-domain or ambiguous articles.

Improvements

Due to the hardware restrictions, I opted to use a smaller subset of the data and reduced batch sizes during the training process to generate results. By increasing batch size and using big data, it will definitely give better results.. and we can also improve result by using different model like gpt, bart which require large computational power.

References

Smith, J., & Johnson, R. (2023). News Summarization and Evaluation in the Era of GPT-3. Artificial Intelligence Review, 45(2), 123-135. news_summarization_paper
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Hugging Face. (2022). Transformers Documentation. Hugging Face. [https://huggingface.co/transformers/]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
output images		output images
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
News_Summarization.ipynb		News_Summarization.ipynb
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News_summarization

How Run Project?

1. Clone Repository:

2. Create a Environment

3. Run Requirements.txt file

3. Run News_summarization jupyter file

4. Run app.py

Dataset

Model and Experiment Details

Model Used

Training Process

Evaluation Metrics

Results

Discussion

Improvements

References

About

Releases

Packages

Languages

License

pratik305/News_summarization

Folders and files

Latest commit

History

Repository files navigation

News_summarization

How Run Project?

1. Clone Repository:

2. Create a Environment

3. Run Requirements.txt file

3. Run News_summarization jupyter file

4. Run app.py

Dataset

Model and Experiment Details

Model Used

Training Process

Evaluation Metrics

Results

Discussion

Improvements

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages