The primary goal of news summarization is to provide readers with a quick understanding of the main ideas and essential details of a news story without requiring them to read the entire article. This is particularly valuable in today's fast-paced world, where individuals may have limited time to consume news but still need to stay informed about current events.
Techniques:
There are various techniques and approaches used in news summarization, including:
Extractive Summarization: In this approach, the summary is generated by selecting and extracting the most important sentences or paragraphs directly from the original article. These sentences are typically chosen based on criteria such as relevance, importance, and coherence.
Abstractive Summarization: Unlike extractive summarization, abstractive summarization involves generating summaries that may contain rephrased or paraphrased information from the original article. This approach often requires natural language processing (NLP) techniques and deep learning models to generate human-like summaries.
Hybrid Approaches: Some summarization systems combine elements of both extractive and abstractive techniques to produce more informative and coherent summaries.
git clone https://github.com/pratik305/News_summarization.git
conda create -n News python=3.10
pip install -r requirements.txt
for running jupyter file i use sagemaker studio lab (free version) you can use any jupyter notebook like colab etc. after running this it will create model folder and tokenizer folder (due to size limit i didn`t uploded in github
python app.py
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.
Data instance
For each instance, there is a string for the article, a string for the highlights, and a string for the id.
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .\nPreviously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
Data Fields
- id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
- article: a string containing the body of the news article
- highlights: a string containing the highlight of the article as written by the article author
Data Split
Dataset Split | Number of Instances in Split |
---|---|
Train | 287,113 |
Validation | 13,368 |
Test | 11,490 |
We can use it own dataset but it should be in above format
- The model utilized for this project is Google Pegasus from Hugging Face, which is a pre-trained model for abstractive text summarization.
- The model was fine-tuned using the CNN/DailyMail dataset in a supervised learning setup. We employed the Trainer API provided by Hugging Face, along with TrainerArguments for configuring the training process. | Step | Training Loss | Validation Loss | |-------|---------------|-----------------| | 500 | 1.631100 | 1.454633 |
- We evaluated the model's performance using the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which measures the overlap between the model-generated summaries and reference summaries.
"rouge1"
: unigram (1-gram) based scoring"rouge2"
: bigram (2-gram) based scoring"rougeL"
: Longest common subsequence based scoring."rougeLSum"
: splits text using"\n"
Model | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-Lsum |
---|---|---|---|---|
Pegasus | 0.022725 | 0.001389 | 0.022731 | 0.022899 |
- The results indicate that the model performs reasonably well in summarizing news articles, particularly in capturing important content. However, there is room for improvement in enhancing the coherence and fluency of the generated summaries, as well as in addressing out-of-domain or ambiguous articles.
Due to the hardware restrictions, I opted to use a smaller subset of the data and reduced batch sizes during the training process to generate results. By increasing batch size and using big data, it will definitely give better results.. and we can also improve result by using different model like gpt, bart which require large computational power.
- Smith, J., & Johnson, R. (2023). News Summarization and Evaluation in the Era of GPT-3. Artificial Intelligence Review, 45(2), 123-135. news_summarization_paper
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Hugging Face. (2022). Transformers Documentation. Hugging Face. [https://huggingface.co/transformers/]