Chris Pollett > Students > Mangesh

    (Print View)

    [Bio]

    [Project Blog]

    [CS297 Proposal]

    [CS297 Presentation-PDF]

    [Text Summarizer based on Intersection method]

    [Text Summarizer based on Centroid method]

    [Text Summarizer based on TF-ISF method]

    [CS297 Report-PDF]

    [CS298 Proposal]

    [CS298 Presentation-PDF]

    [CS298 Report-PDF]

    [Graduation Photo]

                          

























Text Summarizer based on TF-ISF method

Description: In this method, we will be representing document as a weighted vector of TF*IDF as we did in centroid method. After that, we calculate the cosine similarity of each sentence with every other sentence from the document. With cosine similarity scores, we also calculate coverage and diversity of the summary. We enforce coverage and diversity to make the summary more informative and concise by ensuring that it is covering all the topics from the document and removing redundant information from the summary. For enforcing these two functions, we created a single objective function and calculated it using a simple genetic algorithm.

Example:

We are going to generate a summary for following sample document:

Cache and Not Carry: Next Mars Rover to Collect Samples for Return to a Earth Someday

Have rover, need payload. Thats the state of things for NASA, which is planning to launch its next rover to Mars in 2020. The rover has ambitious goals, including searching for signs of habitability and life on the Red Planet, and collecting rock samples to be stored for future return to Earth. Now, NASA is asking scientists to propose instruments that will help the spacecraft accomplish its mission.

The space agency released an "announcement of opportunity" on September 24 calling for proposals by December 23. Researchers who plan to put an instrument in the hat must file a headsup about their plans, called a notice of intent, by October 15.

The design of the 2020 rover will hew closely to that of Curiosity, which landed on Mars in August 2012. The new vehicle will have the same basic body, called a chassis, and will use the same "sky crane" landing system to be lowered onto the surface. But the innards of the rover will be all new, featuring a suite of instruments that move beyond what Curiosity can do.

The instruments must accomplish specific goals for the rover set out in a July report by its Science Definition Team, which disbanded after the report was issued. The goals include scouting for habitable locations and looking for possible signs of past life there, such as microbial fossils and concentrations of organic material. The rover will also be tasked with digging up rock core samples and storing them for future retrieval and return to Earth by a future spacecraft, where they can be studied in laboratories with much more sophisticated instruments than anything that can be sent to Mars.

Because sample storage will take up room inside the rover, however, it wont be able to carry instruments for analyzing dugup samples on Mars as Curiosity does. "Curiosity has flown really high end instruments to do its measurements on the surface of Mars," says Jack Mustard of Brown University, who chaired the Science Definition Team. "What this coming rover will do is arguably a better job of finding materials that are interesting. Its somewhat upgraded in its capabilities to do remote measurements. It doesnt try to do any in situ analysis" like Curiosity Sample Analysis at Mars (SAM) and Chemistry and Mineralogy (CheMin) instruments do.

But that decision has angered some Mars scientists, who say the rover will have to sacrifice too much of its instrument space for caching samples. "I think if we are going to have a Curiosity duplicate rover in 2020, it should be loaded with instruments to do in situ science," says Robert Zubrin, cofounder and president of the Mars exploration advocacy nonprofit, The Mars Society. "This one says its going to have 28 kilograms of science instruments. Curiosity has 80 kilograms. They have reduced the science payload by a factor of three in order to have this caching function, which may not have any utility whatsoever." Zubrin says it leaves too much up to chance to have the return of these samples rely on an unspecified mission in the future making a precision rendezvous and landing at the same spot to collect them.

The Science Definition Team members say the 2020 rover will still be able to do significant science, and its important to initiate Mars sample return now. "This mission I think will be on par, in terms of what we learn, with Curiosity, and hold the future prospect of being able to learn 10 times more by bringing samples back to Earth. None of us are going looking for Klingons, but we would be thrilled if we could help find a sample that contains microbes," says Scott Murchie at Johns Hopkins University Applied Physics Laboratory, who was a member of the team.

Summary generated by summarizer based on TF-ISF function:

Cache and Not Carry: Next Mars Rover to Collect Samples for Return to a Earth Someday.

What this coming rover will do is arguably a better job of finding materials that are interesting.

Its somewhat upgraded in its capabilities to do remote measurements.

"I think if we are going to have a Curiosity duplicate rover in 2020, it should be loaded with instruments to do in situ science," says Robert Zubrin, cofounder and president of the Mars exploration advocacy nonprofit, The Mars Society. They have reduced the science payload by a factor of three in order to have this caching function, which may not have any utility whatsoever. Zubrin says it leaves too much up to chance to have the return of these samples rely on an unspecified mission in the future making a precision rendezvous and landing at the same spot to collect them.

Code for summarizer:

<!DOCTYPE html>
<html>
<head>

</head>
<body>

<?php
ini_set('memory_limit', '-1');
ini_set('max_execution_time', 300);
function microtime_float()
{
    list($usec, $sec) = explode(" ", microtime());
    return ((float)$usec + (float)$sec);
}
# Starting time
$time_start = microtime_float();
echo $time_start."<br><br>";
$doc = file_get_contents("basketball.txt");

echo "<b>Original Content:</b><br>$doc<br>";


# Format the document to remove characters other than periods and alphanumerics.
$formatted_doc = format_doc($doc);

# Remove Stop words
$stop_words = array('a','able','about','above','abst','accordance','according','based','accordingly','across','act',
'added','adj','affected','affecting','affects','after','afterwards','again','against','ah','all','almost','alone',
'along','already','also','although','always','am','among','amongst','an','and','announce','another','any','anybody',
'anyhow','anymore','anyone','anything','anyway','anyways','anywhere','apparently','approximately','are','aren','actually',
'arent','arise','around','as','aside','ask','asking','at','auth','available','away','awfully','b','back','be','became',
'because','become','becomes','becoming','been','before','beforehand','begin','beginning','beginnings','begins',
'behind','being','believe','below','beside','besides','between','beyond','biol','both','brief','briefly','but','by',
'c','ca','came','can','cannot','cant','cause','causes','certain','certainly','co','com','come','comes','contain',
'containing','contains','could','couldnt','d','date','did','didnt','different','do','does','doesnt','doing','done',
'dont','down','downwards','due','during','e','each','ed','edu','effect','eg','eight','eighty','either','else',
'elsewhere','end','ending','enough','especially','et','et-al','etc','even','ever','every','everybody','everyone',
'everything','everywhere','ex','except','f','far','few','ff','fifth','first','five','fix','followed','following',
'follows','for','former','formerly','forth','found','four','from','further','furthermore','g','gave','get','gets',
'getting','give','given','gives','giving','go','goes','gone','got','gotten','h','had','happens','hardly','has',
'hasnt','have','havent','having','he','hed','hence','her','here','hereafter','hereby','herein','heres','hereupon',
'hers','herself','hes','hi','hid','him','himself','his','hither','home','how','howbeit','however','hundred','i',
'id','ie','if','ill','im','immediate','immediately','importance','important','in','inc','indeed','index','information',
'instead','into','invention','inward','is','isnt','it','itd','itll','its','itself','ive','j','just','k','keep',
'keeps','kept','kg','km','know','known','knows','l','largely','last','lately','later','latter','latterly','least','less',
'lest','let','lets','like','liked','likely','line','little','ll','look','looking','looks','ltd','m','made'
,'mainly','make','makes','many','may','maybe','me','mean','means','meantime','meanwhile','merely','mg','might',
'million','miss','ml','more','moreover','most','mostly','mr','mrs','much','mug','must','my','myself','n','na','name',
'namely','nay','nd','near','nearly','necessarily','necessary','need','needs','neither','never','nevertheless',
'new','next','nine','ninety','no','nobody','non','none','nonetheless','noone','nor','normally','nos','not','noted',
'nothing','now','nowhere','o','obtain','obtained','obviously','of','off','often','oh','ok','okay','old',
'omitted','on','once','one','ones','only','onto','or','ord','other','others','otherwise','ought','our','ours','ourselves',
'out','outside','over','overall','owing','own','p','page','pages','part','particular','particularly','past',
'per','perhaps','placed','please','plus','poorly','possible','possibly','potentially','pp','predominantly','present',
'previously','primarily','probably','promptly','proud','provides','put','q','que','quickly','quite','qv','r',
'ran','rather','rd','re','readily','really','recent','recently','ref','refs','regarding','regardless','regards',
'related','relatively','research','respectively','resulted','resulting','results','right','run','s','said','same',
'saw','say','saying','says','sec','section','see','seeing','seem','seemed','seeming','seems','seen','self',
'selves','sent','seven','several','shall','she','shed','shell','shes','should','shouldnt','show','showed','shown',
'showns','shows','significant','significantly','similar','similarly','since','six','slightly','so','some',
'somebody','somehow','someone','somethan','something','sometime','sometimes','somewhat','somewhere','soon','sorry',
'specifically','specified','specify','specifying','still','stop','strongly','sub','substantially','successfully',
'such','sufficiently','suggest','sup','sure','t','take','taken','taking','tell','tends','th','than','thank','thanks',
'thanx','that','thatll','thats','thatve','the','their','theirs','them','themselves','then','thence','there',
'thereafter','thereby','thered','therefore','therein','therell','thereof','therere','theres','thereto','thereupon',
'thereve','these','they','theyd','theyll','theyre','theyve','think','this','those','thou','though','thoughh',
'thousand','throug','through','throughout','thru','thus','til','tip','to','together','too','took','toward','towards',
'tried','tries','truly','try','trying','ts','twice','two','u','un','under','unfortunately','unless','unlike',
'unlikely','until','unto','up','upon','ups','us','use','used','useful','usefully','usefulness','uses','using',
'usually','v','value','various','ve','very','via','viz','vol','vols','vs','w','want','wants','was','wasnt','way','we',
'wed','welcome','well','went','were','werent','weve','what','whatever','whatll','whats','when','whence','whenever',
'where','whereafter','whereas','whereby','wherein','wheres','whereupon','wherever','whether','which','while',
'whim','whither','who','whod','whoever','whole','wholl','whom','whomever','whos','whose','why','widely','willing',
'wish','with','within','without','wont','words','world','would','wouldnt','www','x','y','yes','yet','you','youd',
'youll','your','youre','yours','yourself','yourselves','youve','z','zero');
$doc_stop =preg_replace('/\b('.implode('|',$stop_words).')\b/','',strtolower($doc));

#####Time to remove stop words
$time1 = microtime_float();
echo "<br><br>Time to remove stop words: ",$time1-$time_start;

# Spliting into sentences
$sentences = convtosentences($doc);

#####Time to split into sentences
$time2 = microtime_float();
echo "<br><br>Time to split into sentences: ",$time2-$time1;

$n = count($sentences);

# Spliting into terms
$doc_st = format_sent($doc_stop);

$term=preg_split("/\s/",$doc_st,-1,PREG_SPLIT_NO_EMPTY);

#####Time to split into terms
$time3 = microtime_float();
echo "<br><br>Time to split into terms: ",$time3-$time2;


$terms = array_unique($term);
sort($terms);
$t = count($terms);
for($i=0;$i<$t;$i++)
	$terms[$i] = preg_replace('/[\-]+/','', $terms[$i]);

# Initialize term-frequecy array
$tf=array();
for($i=0;$i<$n;$i++)
{
	for($j=0;$j<$t;$j++)
	{
		$tf[$i][$j] = 0;
		
	}
}

# Initialize Nk array(Number of Documents the term occurs)
for($j=0;$j<$t;$j++)
{
	$nk[$j] = 0;
}

# Count TF for each word

for($i=0;$i<$n;$i++)
{
	for($j=0;$j<$t;$j++)
	{
		$nt = preg_match_all('/\b'.preg_quote($terms[$j]).'\b/', $sentences[$i],$match);
		$tf[$i][$j] = 1+ log($nt);
		if($nt!=0)
			$nk[$j]++;
		else
			$tf[$i][$j] = 0;
	}
}

#####Time to calculate TF
$time4 = microtime_float();
echo "<br><br>Time to calculate TF: ",$time4-$time3;

# Calculate weights of each term for every sentence
$w=array();
$idf = array();;
$idf_temp = 0;
for($i=0;$i<$n;$i++)
{
	for($k=0;$k<$t;$k++)
	{
		$idf_temp = @($n/$nk[$k]);
		if($nk[$k]==0)
			$idf_temp=0;
		$idf[$k] = @log($idf_temp);
		if(is_nan($idf[$k]))
			$idf[$k]=0;
		if(is_infinite($idf[$k]))
			$idf[$k]=0;
		$w[$i][$k] = $tf[$i][$k]*$idf[$k];
		if(is_nan($w[$i][$k]))
			$w[$i][$k]=0;
	}
}

#####Time to calculate 
$time5 = microtime_float();
echo "<br><br>Time to calculate IDF: ",$time5-$time4;


# Calculating similarity score

$sim=array();
for($i=0;$i<$n;$i++)
{
	for($j=$i;$j<$n;$j++)
	{
		$a=$b=$c=$d=$e = 0;
		for($k=0;$k<$t;$k++)
		{
			$a += ($w[$i][$k]*$w[$j][$k]);
			$b += ($w[$i][$k]*$w[$i][$k]);
			$c += ($w[$j][$k]*$w[$j][$k]);
				
		}
		$e=$b*$c;
		$d = sqrt($e);
		$sim[$i][$j] = @($a/$d);
		if($d==0)
			$sim[$i][$j]=0;
	}
}

#####Time to calculate 
$time6 = microtime_float();
echo "<br><br>Time to calculate similarity scores: ",$time6-$time5;

################# Coverage ###################
### Calculating mean vector O
$O = array();

for($k=0;$k<$t;$k++)
{
	$ok = 0;
	for($i=0;$i<$n;$i++)
	{
		$ok += $w[$i][$k];
	}
	$O[$k] = 1/$n*($ok);
}

#Initial pop
$initpop = array();
for($i=0;$i<20;$i++)
{
	for($j=0;$j<$n;$j++)
	{
		$initpop[$i][$j] = 0;

	}
}
$initpop[0][2] = 1;
$initpop[0][4] = 1;
$initpop[0][7] = 1;
$initpop[0][13] = 1;
$initpop[0][17] = 1;

$wi=0;
$pops = 0;
$count = 0;
## Initial population for summary
for($gen=0;$gen<10;$gen++)
{
$j=$wi+5;
while($wi<$j)
{

	$initpop[$wi+1] = $initpop[$pops];
	shuffle($initpop[$wi+1]);
	$population = array();
	$population = array_filter($initpop[$wi+1]);
	$summary1 = null;
	foreach($population as $key => $value)
	{
		$summary1 .= "$sentences[$key]".". ";
	}
	if(strlen($summary1)<3000 && strlen($summary1)>800)
		$wi++;
}

$sol = array();
$wj = $wi-5;
$k = $wi;
while($wj<$k)
{
	
	$sol[$gen][$wj] = hill_climb($initpop[$wj]);
	$wj++;
}

echo "<br><br><b>SOF for Generation $gen is</b><br> ";
print_r($sol[$gen]);
$max[$gen] = max($sol[$gen]);
$maxs = array_keys($sol[$gen], max($sol[$gen]));
$pops = $maxs[0];
$goodpop[$gen] = $pops;
echo "<br><br><b>Maximum value is $max[$gen] with population :<br></b>";
print_r($initpop[$pops]);
$pop = $initpop[$pops];

# hill climbing

if($gen>0 && $max[$gen]<max($max))
	$count++;
if($count == 3)
	break;
echo "<br><br>Population with Maximum SOF at this generation: <br>";
print_r($pops);

} 

$maxgen = max($max);
$maxgenkey = array_keys($max, $maxgen);
$bestgen = $maxgenkey[0];
echo "<br><br>Maximum SOF is at generation $bestgen";
$popsummary = $goodpop[$bestgen];

#####Time to calculate SOF
$time7 = microtime_float();
echo "<br><br>Time to calculate SOF: ",$time7-$time6;

echo "<br><br>Summary based on best population:</b><br>";
$population = array();
$population = array_filter($initpop[$popsummary]);

$summary = null;
foreach($population as $key => $value)
{
	$summary .= "$sentences[$key]".". ";
}

echo "$summary";

# Finishing time
$time_stop = microtime_float();
echo "<br><br>".$time_stop;
echo "<br>Duration:". ($time_stop-$time_start);

## Function for hill climbing algorithm
function hill_climb($pop)
{
	$smod = array_count_values_of(1, $pop);
	global $t,$w,$O,$n,$sim;
	# Calculate mean vector Os
	$Os = array();
	
	$npop = array_filter($pop);
	for($k=0;$k<$t;$k++)
	{
		$osk = 0;
		foreach($npop as $key=> $value)
		{
			$osk += $w[$key][$k];
		}
		$Os[$k] = 1/$smod*($osk);
	}

	## Similarity score of O and Os
	$simoos = calc_similarity($O,$Os);

	# Similarity score between O and Si
	$simosi=0;

	for($i=0;$i<$n;$i++)
	{
		$a=$b=$c=$d=$e = 0;
		$simt=array();
		for($k=0;$k<$t;$k++)
		{
			$a += ($O[$k]*$w[$i][$k]);
			$b += ($O[$k]*$O[$k]);
			$c += ($w[$i][$k]*$w[$i][$k]);
				
		}
		$e=$b*$c;
		$d = sqrt($e);
		$simt = @($a/$d);
		if($d==0)
			$simt=0;
		$simosi += $simt;
	}

	# calculate coverage
	$fcover = $simoos + $simosi;

	######### Calculate Diversity(fdiver)

	$fdiver = 0;
	for($i=0;$i<$n;$i++)
	{
		$ftemp = 0;
		for($j=$i;$j<$n;$j++)
		{
			$ftemp += (1-$sim[$i][$j])*$pop[$i]*$pop[$j];
		}
		$fdiver +=$ftemp;
	}

	####### Calculate single objective function ########

	$alpha = 1;
	$f = ($alpha*(($fcover+$fdiver)/2))+((1-$alpha)*(sqrt($fcover*$fdiver)));
	return $f;
}


function convtosentences($content)
{
	$content = preg_split("/\.\s|[\n\r]+/",$content,-1,PREG_SPLIT_NO_EMPTY);
	return $content;
}	
	
function format_sent($sen)
{
	$sen = trim(preg_replace('/[^a-z0-9\s\-]+/', '', strtolower($sen)));
	return $sen;
}

 ###Calculate similarity scores
function calc_similarity($s1,$s2)
{
	$sim=array();
	global $t;
	$a=$b=$c=$d=$e = 0;
	for($k=0;$k<$t;$k++)
	{
		$a += ($s1[$k]*$s2[$k]);
		$b += ($s1[$k]*$s1[$k]);
		$c += ($s2[$k]*$s2[$k]);
	}
	$e=$b*$c;
	$d = sqrt($e);
	$sim = @($a/$d);
	if($d==0)
		$sim=0;
	return $sim;
}

function format_doc($content)
{
	$content = preg_replace('/[\n\r\-]+/', ' ', strtolower($content));
	$content = trim(preg_replace('/[^a-z0-9\s\.]+/', '', $content));
	return $content;
}

function array_count_values_of($value, $array) {
    $counts = array_count_values($array);
    return $counts[$value];
}


?>
</body>
</html>