Centroid based Text summarization in Python
June 2, 2020, 2:29 p.m. | Text Mining
Introduction
It is an Extractive summarization, which extracts important words from document(s) to form a summary. Centroid-based summarization works as identifying the most central sentences in multiple documents that give the necessary and sufficient amount of information related to the main theme of document(s). A common way of identifying the central sentences is to represent the sentences in vector space.
Centrality of sentence is always defined in terms of centrality of word. TF X IDF scores are used to measure the centrality of words. Here, words that have TF X IDF scores above a predefined cosine threshold are centroid of cluster.
In centroid based summarization, the sentences which have more words from centroid of cluster are considered as central sentences. Finally, those sentences are produced as summary of multiple documents.
Centroid based summarization is introduced by Radev, Blair-Goldensohn, and Zhang.
Implementation of Text Summarization in Python
from os import listdir
import string
import math
"""Method to calculate Inverse Document Frequency Score"""
def calculate_idf(word):
files = [f for f in listdir("E:\Works\Products\doc") ] #Specify the directory where the documents located
count,wcount=2,1
for file1 in files:
file=open("E:\Works\Products\doc\\" +file1,'r') #Specify the directory where the documents located
page=file.read()
if(word in page):
wcount+=1
count+=1
idf=count/wcount
return math.log(idf,10)
"""Method to calculate Centroid Score of sentences"""
def calculate_centroid(sentences):
""""Compute tf X idf score for each word"""
tfidf=dict()
for sentence in sentences:
words=sentence.split()
for word in words:
if word in tfidf:
tfidf[word]+=calculate_idf(word)
else:
tfidf[word]=calculate_idf(word)
"""Construct the centroid of Cluster
By taking the words that are above the threshold"""
centroid=dict()
threshold=0.7
for word in tfidf:
if(tfidf[word]>threshold):
centroid[word]=tfidf[word]
else:
centroid[word]=0
"""Compute the Score for Sentences"""
senctence_score=list()
counter=0
for sentence in sentences:
senctence_score.append(0)
words=sentence.split()
for word in words:
senctence_score[counter]+=centroid[word]
counter=counter+1
return senctence_score
"""Splitting Documents as sentences"""
files = [f for f in listdir("E:\Works\Products\doc") ]
page=""
for file1 in files:
file=open("E:\Works\Products\doc\\" +file1,'r')
page+=file.read()
file.close()
sentences=page.split(".")
senctence_score=calculate_centroid(sentences)
"""Printing Sentences which has more central words"""
for i in range(len(sentences)):
if(senctence_score[i]>15):
print(sentences[i])
Implementation Details
I have implemented the Centroid based text summarization in python, the above-mentioned code is purely for the text summarization which contains method for calculating IDF score of every word and centroid score of every sentence. To understand Python basics refer Python Programming Examples
Centroid score of every sentence is calculated by TF X IDF score of words in the sentence. To easily understand the algorithm of centroid based text summarization, preprocessing techniques like stop word removal and stemming are not given in the above implementation