Chris Pollett > Students > Mangesh

    (Print View)

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [CS297 Presentation-PDF]

    [Text Summarizer based on Intersection method]

    [Text Summarizer based on Centroid method]

    [Text Summarizer based on TF-ISF method]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [CS298 Presentation-PDF]

    [CS298 Report-PDF]

    [Graduation Photo]

                          

























Text Summarizer based on Centroid method

Description: We start with document we wish to summarize. This method gives us a word cloud while generating a summary for that document. For generating word cloud, we are using technique called Topic detection and Tracking which is used in MEAD for finding all the documents with same topic and adding them to a cluster. We first find the centroid of the document i.e. we find the main topic of the document. Then we calculate TF-IDF score of each sentence in document so that we can get the weight of that sentence in a document. Each document is represented as a weighted vector of TF*IDF. After calculating weights, we calculate the cosine similarity between the word cloud i.e. main topic of the document and given sentence from the document. After getting all similarity score between word cloud and each sentence, we keep adding sentences which have the score above threshold in summary until we the limit of the summary length is reached.

Example:

We are going to generate a summary for following sample document:

Cache and Not Carry: Next Mars Rover to Collect Samples for Return to a Earth Someday

Have rover, need payload. Thats the state of things for NASA, which is planning to launch its next rover to Mars in 2020. The rover has ambitious goals, including searching for signs of habitability and life on the Red Planet, and collecting rock samples to be stored for future return to Earth. Now, NASA is asking scientists to propose instruments that will help the spacecraft accomplish its mission.

The space agency released an "announcement of opportunity" on September 24 calling for proposals by December 23. Researchers who plan to put an instrument in the hat must file a headsup about their plans, called a notice of intent, by October 15.

The design of the 2020 rover will hew closely to that of Curiosity, which landed on Mars in August 2012. The new vehicle will have the same basic body, called a chassis, and will use the same "sky crane" landing system to be lowered onto the surface. But the innards of the rover will be all new, featuring a suite of instruments that move beyond what Curiosity can do.

The instruments must accomplish specific goals for the rover set out in a July report by its Science Definition Team, which disbanded after the report was issued. The goals include scouting for habitable locations and looking for possible signs of past life there, such as microbial fossils and concentrations of organic material. The rover will also be tasked with digging up rock core samples and storing them for future retrieval and return to Earth by a future spacecraft, where they can be studied in laboratories with much more sophisticated instruments than anything that can be sent to Mars.

Because sample storage will take up room inside the rover, however, it wont be able to carry instruments for analyzing dugup samples on Mars as Curiosity does. "Curiosity has flown really high end instruments to do its measurements on the surface of Mars," says Jack Mustard of Brown University, who chaired the Science Definition Team. "What this coming rover will do is arguably a better job of finding materials that are interesting. Its somewhat upgraded in its capabilities to do remote measurements. It doesnt try to do any in situ analysis" like Curiosity Sample Analysis at Mars (SAM) and Chemistry and Mineralogy (CheMin) instruments do.

But that decision has angered some Mars scientists, who say the rover will have to sacrifice too much of its instrument space for caching samples. "I think if we are going to have a Curiosity duplicate rover in 2020, it should be loaded with instruments to do in situ science," says Robert Zubrin, cofounder and president of the Mars exploration advocacy nonprofit, The Mars Society. "This one says its going to have 28 kilograms of science instruments. Curiosity has 80 kilograms. They have reduced the science payload by a factor of three in order to have this caching function, which may not have any utility whatsoever." Zubrin says it leaves too much up to chance to have the return of these samples rely on an unspecified mission in the future making a precision rendezvous and landing at the same spot to collect them.

The Science Definition Team members say the 2020 rover will still be able to do significant science, and its important to initiate Mars sample return now. "This mission I think will be on par, in terms of what we learn, with Curiosity, and hold the future prospect of being able to learn 10 times more by bringing samples back to Earth. None of us are going looking for Klingons, but we would be thrilled if we could help find a sample that contains microbes," says Scott Murchie at Johns Hopkins University Applied Physics Laboratory, who was a member of the team.

Wordcloud:

team, collect, analysis, report, carry, learn, science, sample, future, return, going, goals, mission, 2020, instrument, space, called, measurements, spacecraft, scientists

Summary generated by summarizer based on Centroid function:

Now, NASA is asking scientists to propose instruments that will help the spacecraft accomplish its mission.

The instruments must accomplish specific goals for the rover set out in a July report by its Science Definition Team, which disbanded after the report was issued.

But that decision has angered some Mars scientists, who say the rover will have to sacrifice too much of its instrument space for caching samples.

The Science Definition Team members say the 2020 rover will still be able to do significant science, and its important to initiate Mars sample return now.

This mission I think will be on par, in terms of what we learn, with Curiosity, and hold the future prospect of being able to learn 10 times more by bringing samples back to Earth.

None of us are going looking for Klingons, but we would be thrilled if we could help find a sample that contains microbes, says Scott Murchie at Johns Hopkins University Applied Physics Laboratory, who was a member of the team.

Code for slider:

<html>
<head>
<link rel="stylesheet" href="http://code.jquery.com/ui/1.10.3/themes/smoothness/jquery-ui.css" />
  <script src="http://code.jquery.com/jquery-1.9.1.js"></script>
  <script src="http://code.jquery.com/jquery-latest.min.js"></script>
  <script src="http://code.jquery.com/ui/1.10.3/jquery-ui.js"></script>
  <link rel="stylesheet" href="/resources/demos/style.css" />
  <script>
  $(document).ready(function() {
  
  $("#slider").slider({
      min: 100, //minimum value
      max: 1000, //maximum value
      value: 200, //default value
      slide: function(event, ui) {
          $("#value2").val(ui.value);
          }
      });
  $("#value2").val($("#slider").slider("value"));
});
</script>
</head>
<body>

Drag the handle to select the length of desired summary:

	
<form action="centroid.php" method="POST">
<input id="value2" type="text" name="slide" />
<div id="slider"></div>
<input type="submit" />
</form>
</body>
</html>

Code for summarizer:

<!DOCTYPE html>
<html>
<head>

</head>
<body>

<?php
ini_set('memory_limit', '-1');
function microtime_float()
{
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
# Starting time
$time_start = microtime_float();
echo $time_start."<br><br>";
$doc = file_get_contents("sjsu.txt");
echo "<b>Original Content:</b><br>$doc<br>";


# Format the document to remove characters other than periods and alphanumerics.
$formatted_doc = format_doc($doc);

# Remove Stop words
$stop_words = array('a','able','about','above','abst','accordance','according','based','accordingly','across','act',
'actually','added','adj','affected','affecting','affects','after','afterwards','again','against','ah','all','almost',
'along','already','also','although','always','am','among','amongst','an','and','announce','another','any','anybody',
'anyhow','anymore','anyone','anything','anyway','anyways','anywhere','apparently','approximately','are','aren',
'arent','arise','around','as','aside','ask','asking','at','auth','available','away','awfully','b','back','be','became',
'because','become','becomes','becoming','been','before','beforehand','begin','beginning','beginnings','begins',
'behind','being','believe','below','beside','besides','between','beyond','biol','both','brief','briefly','but','by',
'c','ca','came','can','cannot','cant','cause','causes','certain','certainly','co','com','come','comes','contain',
'containing','contains','could','couldnt','d','date','did','didnt','different','do','does','doesnt','doing','done',
'dont','down','downwards','due','during','e','each','ed','edu','effect','eg','eight','eighty','either','else',
'elsewhere','end','ending','enough','especially','et','et-al','etc','even','ever','every','everybody','everyone',
'everything','everywhere','ex','except','f','far','few','ff','fifth','first','five','fix','followed','following',
'follows','for','former','formerly','forth','found','four','from','further','furthermore','g','gave','get','gets',
'getting','give','given','gives','giving','go','goes','gone','got','gotten','h','had','happens','hardly','has',
'hasnt','have','havent','having','he','hed','hence','her','here','hereafter','hereby','herein','heres','hereupon',
'hers','herself','hes','hi','hid','him','himself','his','hither','home','how','howbeit','however','hundred','i',
'id','ie','if','ill','im','immediate','immediately','importance','important','in','inc','indeed','index','information',
'instead','into','invention','inward','is','isnt','it','itd','itll','its','itself','ive','j','just','k','keep',
'keeps','kept','kg','km','know','known','knows','l','largely','last','lately','later','latter','latterly','least',
'lest','let','lets','like','liked','likely','line','little','ll','look','looking','looks','ltd','m','made','less'
,'mainly','make','makes','many','may','maybe','me','mean','means','meantime','meanwhile','merely','mg','might',
'million','miss','ml','more','moreover','most','mostly','mr','mrs','much','mug','must','my','myself','n','na','name',
'namely','nay','nd','near','nearly','necessarily','necessary','need','needs','neither','never','nevertheless',
'new','next','nine','ninety','no','nobody','non','none','nonetheless','noone','nor','normally','nos','not','noted',
'nothing','now','nowhere','o','obtain','obtained','obviously','of','off','often','alone','oh','ok','okay','old',
'omitted','on','once','one','ones','only','onto','or','ord','other','others','otherwise','ought','our','ours',
'out','outside','over','overall','owing','own','p','page','pages','part','particular','particularly','past',
'per','perhaps','placed','please','plus','poorly','possible','possibly','potentially','pp','predominantly','present',
'previously','primarily','probably','promptly','proud','provides','put','q','que','quickly','quite','qv','r',
'ran','rather','rd','re','readily','really','recent','recently','ref','refs','regarding','regardless','regards',
'related','relatively','research','respectively','resulted','resulting','results','right','run','s','said','same',
'saw','say','saying','says','sec','section','see','seeing','seem','seemed','seeming','seems','seen','self','ourselves',
'selves','sent','seven','several','shall','she','shed','shell','shes','should','shouldnt','show','showed','shown',
'showns','shows','significant','significantly','similar','similarly','since','six','slightly','so','some',
'somebody','somehow','someone','somethan','something','sometime','sometimes','somewhat','somewhere','soon','sorry',
'specifically','specified','specify','specifying','still','stop','strongly','sub','substantially','successfully',
'such','sufficiently','suggest','sup','sure','t','take','taken','taking','tell','tends','th','than','thank','thanks',
'thanx','that','thatll','thats','thatve','the','their','theirs','them','themselves','then','thence','there',
'thereafter','thereby','thered','therefore','therein','therell','thereof','therere','theres','thereto','thereupon',
'thereve','these','they','theyd','theyll','theyre','theyve','think','this','those','thou','though','thoughh',
'thousand','throug','through','throughout','thru','thus','til','tip','to','together','too','took','toward','towards',
'tried','tries','truly','try','trying','ts','twice','two','u','un','under','unfortunately','unless','unlike',
'unlikely','until','unto','up','upon','ups','us','use','used','useful','usefully','usefulness','uses','using',
'usually','v','value','various','ve','very','via','viz','vol','vols','vs','w','want','wants','was','wasnt','way','we',
'wed','welcome','well','went','were','werent','weve','what','whatever','whatll','whats','when','whence','whenever',
'where','whereafter','whereas','whereby','wherein','wheres','whereupon','wherever','whether','which','while',
'whim','whither','who','whod','whoever','whole','wholl','whom','whomever','whos','whose','why','widely','willing',
'wish','with','within','without','wont','words','world','would','wouldnt','www','x','y','yes','yet','you','youd',
'youll','your','youre','yours','yourself','yourselves','youve','z','zero');
$doc_stop =preg_replace('/\b('.implode('|',$stop_words).')\b/','',strtolower($doc));

#####Time to remove stop words
$time1 = microtime_float();
echo "<br><br>Time to remove stop words: ",$time1-$time_start;

# Spliting into sentences
$sentences = convtosentences($doc);

#####Time to split into sentences
$time2 = microtime_float();
echo "<br><br>Time to split into sentences: ",$time2-$time1;

$n = count($sentences);

# Spliting into terms
$doc_st = format_sent($doc_stop);

$term=preg_split("/\s/",$doc_st,-1,PREG_SPLIT_NO_EMPTY);

#####Time to split into terms
$time3 = microtime_float();
echo "<br><br>Time to split into terms: ",$time3-$time2;


$terms = array_unique($term);
sort($terms);
$t = count($terms);
for($i=0;$i<$t;$i++)
	$terms[$i] = preg_replace('/[\-]+/','', $terms[$i]);

# Initialize term-frequecy array
$tf=array();
for($i=0;$i<$n;$i++)
{
	for($j=0;$j<$t;$j++)
	{
		$tf[$i][$j] = 0;
		
	}
}

# Initialize Nk array(Number of Documents the term occurs)
for($j=0;$j<$t;$j++)
{
	$nk[$j] = 0;
}

# Count TF for each word

for($i=0;$i<$n;$i++)
{
	for($j=0;$j<$t;$j++)
	{
		$nt = preg_match_all('/\b'.preg_quote($terms[$j]).'\b/', $sentences[$i],$matches);
		$tf[$i][$j] = 1+ log($nt);
		if($nt!=0)
			$nk[$j]++;
		else
			$tf[$i][$j] = 0;
	}
}

#####Time to calculate TF
$time4 = microtime_float();
echo "<br><br>Time to calculate TF: ",$time4-$time3;

# Calculate weights of each term for every sentence
$w=array();
$idf = array();;
$idf_temp = 0;
for($i=0;$i<$n;$i++)
{
	for($k=0;$k<$t;$k++)
	{
		$idf_temp = @($n/$nk[$k]);
		if($nk[$k]==0)
			$idf_temp=0;
		$idf[$k] = @log($idf_temp);
		if(is_nan($idf[$k]))
			$idf[$k]=0;
		if(is_infinite($idf[$k]))
			$idf[$k]=0;
		$w[$i][$k] = $tf[$i][$k]*$idf[$k];
		if(is_nan($w[$i][$k]))
			$w[$i][$k]=0;
	}
}

#####Time to calculate 
$time5 = microtime_float();
echo "<br><br>Time to calculate IDF: ",$time5-$time4;

# Count TF for finding centroid
$doc_centroid = preg_replace('/[\.]+/', ' ', $formatted_doc);
$wc = array();
for($j=0;$j<$t;$j++)
	{
	
		$nt = preg_match_all('/\b'.preg_quote($terms[$j]).'\b/', $doc_centroid,$matche);
		$tfc[$j] = 1+ log($nt);
		$wc[$j] = $tfc[$j]*$idf[$j];
		if(is_nan($wc[$j])||is_infinite($wc[$j]))
			$wc[$j]=0;
	}


# Calculate centroid

arsort($wc);
$centroid = array();
$centroid = array_slice($wc,0,20,true);

# Initializing centroid weight array by 0
for($i=0;$i<$t;$i++)
{
	$wc[$i] =0;
	
}


# Trying out word cloud
echo "<br><br><b>Word Cloud:</b><br>";
foreach($centroid as $key=>$value)
{
	$value = $value*2;
	echo "<b style='font-size: ".$value."px;'>".$terms[$key].", "."</b>";
	$wc[$key]=$value;
}

#####Time to create word cloud
$time6 = microtime_float();
echo "<br><br>Time to create word cloud: ",$time6-$time5;


# Centroid vector
ksort($wc);


# Calculate similarity measure between centroid and each sentence
$sim=array();
for($i=0;$i<$n;$i++)
{
	$a=$b1=$b2=$c1=$c2=$d = 0;
	for($k=0;$k<$t;$k++)
	{
			$a += ($w[$i][$k]*$wc[$k]*$idf[$k]);
			$b1 += ($w[$i][$k]*$w[$i][$k]);
			$c1 += ($wc[$k]*$wc[$k]);
			
	}
	$b2 = sqrt($b1);
	$c2 = sqrt($c1);
	$d = $b2*$c2;
	$sim[$i] = @($a/$d);
	if($d==0)
		$sim[$i]=0;
}

#####Time to calculate similarity score for each doc
$time7 = microtime_float();
echo "<br><br>Time to calculate similarity score for each doc: ",$time7-$time6;


echo "<br><br><b>Similarity score of each sentence with centroid:</b><br> ";
print_r($sim);

echo "<br><br><b>Summary Sorted Array:</b><br>";
arsort($sim);

# Getting how many sentences should be there in summary
$top = sent_summary($sentences,$sim);
print_r ($sim);
$sum_array = array();
$sum_array = array_slice($sim,0,$top-1,true);
echo "<br><br><b>Summary Array:</b><br>";
ksort($sum_array);
print_r($sum_array);

#Printing Summary

$summary='';
foreach($sum_array as $key=>$value)
{
	$summary .= "$sentences[$key]".". ";
}

	
# Summary of text summarization
$words = explode(" ",$doc);
$sum_words = explode(" ",$summary);
echo "<br><br><b>Summary :</b><br> $summary";

echo "<br><br>Original Length:"." ". count($words)."<br>";
echo "Summary Length:"." ". count($sum_words)."<br>";
echo "Summary Length(in characters):"." ". strlen($summary)."<br>";
echo "Summary Ratio: ".(100 - (100 * count($sum_words) / (count($words))));

# Finishing time
$time_stop = microtime_float();
echo "<br><br>".$time_stop;
echo "<br>Duration:". ($time_stop-$time_start);

function sent_summary($sentences,$sim)
{
	$top=null;
	$count = 0;
	foreach($sim as $key=>$value)
	{
		if($count<$_POST["slide"]) 
		{
			$count += strlen($sentences[$key]);
			$top++;
		}
	}
	return	$top;
}


function convtosentences($content)
{
	$content = preg_split("/\.\s|[\n\r]+/",$content,-1,PREG_SPLIT_NO_EMPTY);
	return $content;
}	
	
function format_sent($sen)
{
	$sen = trim(preg_replace('/[^a-z0-9\s\-]+/', '', strtolower($sen)));
	return $sen;
}

function format_doc($content)
{
	$content = preg_replace('/[\n\r\-]+/', ' ', strtolower($content));
	$content = trim(preg_replace('/[^a-z0-9\s\.]+/', '', $content));
	return $content;
}



?>
</body>
</html>