PHP: Generating Summary Abstract from A Text or HTML String, Limiting by Words or Sentences

by Yang Yang on February 28, 2009

On index or transitional pages, such as homepage or category pages of WordPress, you don’t want to show the full texts of your deep content pages yet but just a content snippet of the first few sentences or words as a summary with a read more link to the actual article.

This is generally good in terms of SEO as it reduces duplicate content on your site and increases page views. With WordPress you can simply achieve this by using a plugin named Evermore. However, with a home made CMS to select and display content abstracts, you will have to code a little bit on your own.

While you may be better off doing this with a plain SQL which I’m not an expert in, I’ll let in a little trick of PHP to accomplish the same task here.

Full HTML Text
$text = <<<TEXT    
I wrote a <a href="#">blog post</a> yesterday about Chinese web design fonts. What did you think? It appeared that many are very interested. I guess it's the language barriers and cultural differences that make the westerners eager to know more about us. All right then, let me write more about that and maybe start a <strong>brand new domain</strong> for it. Stay tuned!
TEXT;
The Problem – select first sentences

Select and display the first 3 sentences (max) of the full HTML text above.

The Solution
<?php
preg_match('/^([^.!?]*[\.!?]+){0,3}/', strip_tags($text), $abstract);
echo $abstract[0];
?>

Output:

I wrote a blog post yesterday about Chinese web design fonts. What did you think? It appeared that many are very interested.

Stripping out HTML tags for the summary is to prevent it from producing invalid HTML snippets as it’s possible that the process slices HTML elements in half, leaving just part of the tag or only the beginning tag there. However, you can always preserve tags in the abstract, with a slightly more sophisticated algorithm of course.

Another Problem – select first words

You want to distill an abstract of the first 30 words instead of sentences concluded by period punctuations such as ‘.’, ‘!’ and ‘?’.

The Solution

Simply modify the regular expression to:

/^([^.!?\s]*[\.!?\s]+){0,30}/

Output:

I wrote a blog post yesterday about Chinese web design fonts. What did you think? It appeared that many are very interested. I guess it's the language barriers and cultural

There’s an incomplete sentence so you may want to add a trailing of ‘…’ at the end to denote the abstract nature.

In regular expressions, \s stands for all sorts of white spaces including single-byte space, tab and new line.

Scott March 25, 2009 at 1:48 pm

informative indeed!

Yang Yang February 26, 2010 at 4:51 pm

actually, i doubt that

Pradeep August 27, 2010 at 1:48 am

What if the first line contains only dots like this ………………………………………………………

Comments on this entry are closed.

Previous post:

Next post: