11.30.06

BBC feeds reparser (continued)

A brief little update on the script I’ve written which I blogged about the other day which reparses the partial feeds from the BBC news website and creates full feeds of these.

I’ve gone through the script tonight adding scrape caching (6 hour cache currently) and a few other tweaks, this has decreased the build time on it by around three quarters so quite a bit, although of
course this varies if articles have to be fetched etc.

Now that I’ve done this I can now add a little more filtering to the scraper to remove the occasional unwanted elements which creep up from time to time such as voting forms which I didn’t originally think about. Another thing which I want to deal with can be seen if you look at the source code on any BBC news article, there are very few closing tags for paragraphs so I’d like to deal with this within my parser at least.

Its been quite interesting actually now that I’ve released this, I added a log to the script to
allow me to see more detail what is being fetched and when (and there are quite a large number of people using it). I didn’t realise for instance that the useragent string pulled from a Bloglines request gives the number of subscribers which its catering for which is quite useful. Google reader however doesn’t appear to do this which is a pity but of no real significance at this stage.

Tags: , , , 

11.27.06

BBC RSS reparser

What?

Basically I wanted full article text and images from the and so I’ve built what is at the moment a fairly simple reparser to scrape the rest of the content and include it in the feed.

I’ve built it as part of another project but also so that when I’m getting the train into work I can read the full headlines from the BBC RSS feeds and not just the first line or so without forking out for a mobile data plan. I use an so essentially in the morning the  feed client I use updates off my home wifi before I leave. I am planning on testing on other devices but havn’t had a opportunity thus far.

How?

Its not that complicated a script but essentially it reads the requested RSS feed, scrapes the target links for each item in the channel and pumps it back out with that full text. You can also choose not to include images if you have a device with limited storage, the XML generated on its own is around 110Kb feed dependent and of course the images will increase the total download size quite considerably if you wish to do as I do and cache it all to your mobile device.

At the moment it caches the original RSS feeds for an hour but doesn’t cache the scraped content, this is something I still need to work on, an optional item limit might be useful as well, easy to implement but still needs a spare moment or so which I need to find!

Using it…

The below link will allow you to build your own feed based on a BBC News RSS feed, I’ve tested the available feeds and believe thus far they are producing a satisfactory valid output using the.

Access Full Feeds info

I’ve tested so far in Egress, Bloglines and readers. It is still a little messy in its implementation, please remember its still a work in progress!

Update! 29th Nov
  • The script is being hit a great deal more than I expected indicating that a) I need to optimise it a little more for efficiency/speed and b) there is a demand for full feeds (no surprises there!)
  • I’ll be updating the script over the next 24 hours to include caching of the article texts, this will a) increase speed, a lot! b) enable me to do a more comprehensive filter of the tags and article contents to remove forms and clean up the rather dirty markup which results within the RSS>Item>Description part of the feed. I’ll have to think about this a bit more in terms of how long to cache this for etc but it should be done by thursday early am.

Tags: , , , ,Â