How to Summarize Text Using Python NLP and Extractive Text Summarization¶
In this tutorial, learn how Python text summarization works by exploring and comparing 3 classic extractive algorithms: Luhn's algorithm,1 LexRank,2 and Latent Semantic Analysis (LSA).3
While modern transformer model architectures based on neural networks dominate many NLP tasks, this tutorial focuses on classical approaches that remain valuable in data science workflows where interpretability, limited dependencies, and predictable summary length matter. These methods are often used to automate the generation of concise summaries from a large corpus without requiring a labeled dataset.
By the end of this tutorial you'll understand:
- How frequency-based, graph-based, and semantic summarization algorithms work.
- The strengths and limitations of each approach.
- How to implement these algorithms in Python using the Sumy library.
- When to choose extractive versus abstractive text summarization for your projects.
Extractive vs abstractive summarization¶
Text summarization can be broadly categorized into two approaches:
Extractive summarization selects and combines existing sentences directly from the source text to create a summary. Think of it like highlighting the most important sentences in a document. This tutorial focuses on extractive methods, which dominated the field for decades and remain valuable for their interpretability and reliability.
Abstractive summarization generates new sentences to convey the original meaning, similar to how we might paraphrase or rewrite key points. Modern large language models (LLMs) like GPT, Granite, and Claude excel at this approach, although extractive methods still offer advantages in transparency and computational efficiency.Abstractive systems typically rely on transformers pre-trained on large-scale data, sometimes fine-tuned for specific summarization tasks or machine translation objectives.
The evolution of extractive summarization¶
Automatic text summarization began in 1958 with Hans Peter Luhn, an IBM researcher who published "The Automatic Creation of Literature and Abstracts." Luhn's algorithm was groundbreaking in its simplicity: determine sentence importance by counting the frequency of meaningful words. Though basic by today's standards, this frequency-based approach established the foundation for subsequent work in the field.
Luhn's statistical method had clear limitations-- it couldn't capture semantic relationships, context, or nuance in language. Over the following decades, researchers expanded on his work by incorporating:
- Graph-based methods like LexRank, which identify important sentences by analyzing similarity patterns across the entire document.
- Semantic approaches like LSA, which uncover hidden thematic structures using linear algebra to understand meaning beyond surface-level word matching.
Understanding these algorithms illuminates fundamental concepts in information retrieval (IR) and natural language processing (NLP), while illustrating the field's evolution from simple rule-based systems to the sophisticated deep-learning models we use today. Today, these models are commonly accessed through platforms like Hugging Face, exposed via an API, and powered by frameworks such as PyTorch.
The following section provides a step-by-step walkthrough for implementing classic extractive text summarization algorithms in Python.
Steps¶
Step 1. Clone the GitHub repository¶
To run this project, clone the GitHub repository by using https://github.com/IBM/ibmdotcom-tutorials.git as the HTTPS URL. For detailed steps on how to clone a repository, refer to the GitHub documentation.
You can find this specific tutorial inside the ibmdotcom-tutorials repo under the
generative AI directory.
Step 2. Set up your environment¶
This tutorial uses a Jupyter Notebook to demonstrate text summarization with Python using the Sumy, a lightweight python library rather than a large-scale artificial intelligence system. Jupyter notebooks are versatile tools that allow you to combine code, text, and visualization in a single environment. You can run this notebook in your local IDE or explore cloud-based options like watsonx.ai Runtime, which provides a managed environment for running Jupyter Notebooks.
Whether you choose to run the notebook locally or in the cloud, the steps and code remain the same. Simply ensure that the required Python libraries are installed in your environment.
Step 3. Install and import¶
The following python code installs the required packages and prepares the environment for running extractive summarization techniques.
%pip install sumy
%pip install lxml_html_clean
%pip install requests beautifulsoup4
%pip install numpy
import requests # Import requests library
from bs4 import BeautifulSoup # Add BeautifulSoup for HTML parsing
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer # Import LuhnSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer # Import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer # Import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt_tab')
Summarize texts with different algorithms¶
Luhn¶
As previously mentioned, the Luhn algorithm is a statistical, frequency-based approach to extractive summarization. Luhn's algorithm works on the premise that the most important sentences in a document are those that contain the most significant words. This approach makes Luhn particularly effective for quickly extracting salient sentences without semantic modeling. Significant words are determined by how frequently (but not too frequently) they occur.
Luhn algorithm workflow
- Preprocessing: Common words, called stop words, that appear frequently in language ( like "a", "this", "is", "the") but have little meaningful information on their own are filtered out of the text data. A technique called stemming is applied to reduce words to their root forms ("running", "runs", "ran" -> "run").
- Word scoring: The frequency of each word is calculated. Words that appear with moderate to high frequency are considered significant, while extremely common words, (including stop words), and very rare words are given less weight.
- Sentence scoring: Each sentence is scored based on clusters of significant words. A sentence's score is based on the density of significant words within these clusters, calculated as the ratio of significant words to total words in the cluster.
- Summary generation: The top-scoring sentences are selected and presented in their original order to create a summary.
Try it out yourself by running the following codeblock:
# Luhn Extractive Summarization Example
def luhn_summarize(input_data, sentence_count=2, input_type="text"):
"""
Summarize text using the Luhn algorithm.
Args:
input_data (str): The input text, URL, or file path.
sentence_count (int): Number of sentences for the summary.
input_type (str): Type of input - "text", "url", or "file".
Returns:
list: Summary sentences.
"""
if input_type == "url":
response = requests.get(input_data)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser') # Parse HTML content
text = soup.get_text() # Extract plain text
elif input_type == "file":
with open(input_data, 'r', encoding='utf-8') as file:
text = file.read()
else:
text = input_data
# Parse the input text
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize summarizer with stemmer
summarizer = LuhnSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Generate summary
summary = summarizer(parser.document, sentence_count)
return summary
# Test with sample text
sample_text = """
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries.
Modern approaches can identify the main ideas in a document and present them with minimal human involvement.
Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning.
These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
"""
# Summarize plain text
summary = luhn_summarize(sample_text, 2, input_type="text")
print("Summary from text:")
for sentence in summary:
print(sentence)
# Summarize from a URL
url = "https://www.ibm.com/think/topics/natural-language-processing"
summary = luhn_summarize(url, 2, input_type="url")
print("\nSummary from URL:")
for sentence in summary:
print(sentence)
Example Luhn algorithm summarization¶
Below is an example of the expected output (you may get different summarization results depending on factors like library versions, input formatting, and tokenization):
Summary from text:
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries.
Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning.
Summary from URL:
NLP enables computers and digital devices to recognize, understand and generate text and speech by combining computational linguistics, the rule-based modeling of human language together with statistical modeling, machine learning and deep learning.
In document processing, NLP tools can automatically classify, extract key information and summarize content, reducing the time and errors associated with manual data handling.
LexRank¶
LexRank is an extractive summarization algorithm that applies the concept of graph-based ranking to text summarization techniques focused on sentence centrality. It ranks sentences based on their similarity to other sentences.
LexRank algorithm workflow
- Generate a similarity graph: Each sentence is represented as a node in a graph. The similarity is calculated between every pair of sentences, (typically using cosine similarity on TF-IDF vectors). Sentences are connected with edges weighted by similarity scores.
- Compute sentence centrality: Importance scores are calculated for each sentence using an iterative voting process inspired by Google's PageRank algorithm. Each sentence begins with an equal score. In each iteration, a sentence's score is updated based on the scores of the sentences it is connected to. Sentences that are similar to many highly-scored sentences will themselves receive higher scores. This process repeats until the score stabilizes, with a dampening factor ensuring convergence. The result is a reinforcement effect where sentences discussing central themes naturally accumulate higher scores.
- Select top sentences: The highest-scoring sentences are extracted to form a summary, presented in their original document order.
Try LexRank summarization below:
# LexRank Extractive Summarization Example
def lexrank_summarize(input_data, sentence_count=2, input_type="text"):
"""
Summarize text using the LexRank algorithm.
Args:
input_data (str): The input text or URL to summarize.
sentence_count (int): Number of sentences for the summary.
input_type (str): Type of input - "text" or "url".
Returns:
list: Summary sentences.
"""
if input_type == "url":
response = requests.get(input_data)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser') # Parse HTML content
text = soup.get_text(separator=' ') # Extract plain text
else:
text = input_data
# Parse the input text
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize LexRank summarizer with stemmer
summarizer = LexRankSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Generate summary
summary = summarizer(parser.document, sentence_count)
return summary
# Test with sample text
sample_text = """
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries.
Modern approaches can identify the main ideas in a document and present them with minimal human involvement.
Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning.
These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
"""
# Summarize plain text
summary = lexrank_summarize(sample_text, 2, input_type="text")
print("Summary from text:")
for sentence in summary:
print(sentence)
# Summarize from a URL
url = "https://www.ibm.com/think/topics/natural-language-processing"
summary = lexrank_summarize(url, 2, input_type="url")
print("\nSummary from URL:")
for sentence in summary:
print(sentence)
Example LexRank algorithm summarization¶
Here are the example summarization results using LexRank:
Summary from text:
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries.
Modern approaches can identify the main ideas in a document and present them with minimal human involvement.
Summary from URL:
What is NLP?
AI models
You might notice that the URL summary is not as complete as the previous algorithm. Sometimes, LexRank produces very short or unusual summaries when summarizing content from a URL. This happens because LexRank relies on comparing sentence similarity, and webpages often contain short headings or fragmented text that don’t provide enough context for the algorithm to rank sentences meaningfully. In contrast, Luhn looks at word frequency, so it can still pick out the most important sentences even in sparse or messy text. This illustrates that while LexRank is powerful for well-structured documents, it’s not always the best choice for web scraping or heading-heavy content.
LSA¶
Latent semantic analysis (LSA) is a technique that extracts hidden conceptual meaning from a text. LSA identifies the core concepts in a document and selects sentences that best represent those concepts.
LSA algorithm workflow
- Create a term-sentence matrix: The algorithm beings by building a matrix where rows represent words, columns represent sentences, and each cell contains a word's frequency (or TF-IDF weight) in that sentence. Before sentence construction, documents undergo truncation to control size and reduce noise.
- Apply singular value decomposition (SVD): Decompose this matrix into three matrices to capture the underlying semantic structure: $U$ (words to topics), $\Sigma$ (relative topic strength), $V^{T}$ (topics to sentences). SVD identifies the most important "topics" or "concepts" in the document by finding patterns in how words co-occur across sentences.
- Score sentences: A calculation is made that represents the most important concepts identified by SVD for each sentence. Sentences with strong representation across top concepts receive higher scores.
- Generate summary: Top-scoring sentences are selected and presented in their original order.
Try LSA out yourself below:
# LSA Extractive Summarization Example
def lsa_summarize(input_data, sentence_count=2, input_type="text"):
"""
Summarize text using the LSA algorithm.
Args:
input_data (str): The input text or URL to summarize.
sentence_count (int): Number of sentences for the summary.
input_type (str): Type of input - "text" or "url".
Returns:
list: Summary sentences.
"""
if input_type == "url":
response = requests.get(input_data)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser') # Parse HTML content
text = soup.get_text(separator=' ') # Extract plain text
else:
text = input_data
# Parse the input text
parser = PlaintextParser.from_string(text, Tokenizer("english"))
# Initialize LSA summarizer with stemmer
summarizer = LsaSummarizer(Stemmer("english"))
summarizer.stop_words = get_stop_words("english")
# Generate summary
summary = summarizer(parser.document, sentence_count)
return summary
# Test with sample text
sample_text = """
Text summarization is an important area of natural language processing (NLP) that focuses on condensing large amounts of text into shorter, coherent summaries.
Modern approaches can identify the main ideas in a document and present them with minimal human involvement.
Extractive methods select representative sentences directly from the source text, while abstractive methods generate new phrasing based on the original meaning.
These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
"""
# Summarize plain text
summary = lsa_summarize(sample_text, 2, input_type="text")
print("Summary from text:")
for sentence in summary:
print(sentence)
# Summarize from a URL
url = "https://www.ibm.com/think/topics/natural-language-processing"
summary = lsa_summarize(url, 2, input_type="url")
print("\nSummary from URL:")
for sentence in summary:
print(sentence)
Example LSA algorithm summarization¶
Below are example summarization results use LSA:
Summary from text:
Modern approaches can identify the main ideas in a document and present them with minimal human involvement.
These techniques are increasingly used in information retrieval, research analysis, and other applications where quick understanding of text is essential.
Summary from URL:
NLP is already part of everyday life for many, powering search engines, prompting chatbots for customer service with spoken commands, voice-operated GPS systems and question-answering digital assistants on smartphones such as Amazon’s Alexa, Apple’s Siri and Microsoft’s Cortana.
But NLP solutions can become confused if spoken input is in an obscure dialect, mumbled, too full of slang, homonyms, incorrect grammar, idioms, fragments, mispronunciations, contractions or recorded with too much background noise.
LSA differs from Luhn and LexRank because it focuses on the underlying concepts or topics in a text rather than just word frequency or sentence similarity. Luhn is great when you want a broad summary based on important keywords, and LexRank works well for well-structured text where sentence relationships matter. LSA, however, is ideal when you want a coherent, concept-focused summary, especially for longer documents with multiple paragraphs, because it can highlight the main ideas without getting distracted by repeated keywords or short headings. In short, choose LSA when understanding the key themes is more important than capturing every high-frequency term.
Conclusion¶
In this tutorial, you explored three classic extractive summarization algorithms—Luhn, LexRank, and LSA—and learned how they approach the task in different ways. Luhn focuses on word frequency, LexRank uses sentence similarity, and LSA identifies underlying concepts to select the most meaningful sentences. Each method has its strengths: Luhn works well for general keyword-based summaries, LexRank is effective for structured text with clear sentence relationships, and LSA shines when you want a coherent, concept-focused overview of longer documents. Understanding these approaches gives you the foundation to choose the right extractive summarization technique for your projects and shows how the field has evolved from simple rule-based methods to sophisticated semantic analysis. These classic approaches also serve as strong baselines when evaluating modern abstractive systems or designing hybrid pipelines for real-world use cases.
Footnotes¶
[1] Luhn, Hans Peter. "The automatic creation of literature abstracts." IBM Journal of research and development 2, no. 2 (1958): 159-165.
[2] Erkan, Günes, and Dragomir R. Radev. "Lexrank: Graph-based lexical centrality as salience in text summarization." Journal of artificial intelligence research 22 (2004): 457-479.
[3] Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. "Indexing by latent semantic analysis." Journal of the American society for information science 41, no. 6 (1990): 391-407.