Recommend relevant content
Today we are pleased to announce the release of our Article Extractor API! When recommending content it's important to ensure you are only recommending for the relevant text of an article. We have often faced this challenge with online articles and blogs. We'd want to fetch a URL but just extract the main body of text, no headers, footers, sidebars or anything like that. We developed an internal API that we used for things like NewsBot and the news recommender. We've had a lot of people interested in using it. So, having polished the code a bit, we're now releasing the API to the public!
Technical details
The API takes a URL of a blog or article and then extracts the title, author, date published, image, videos, keywords, meta description and the article body. It's worth noting that whilst the article extraction works most of the time, there are some cases when it doesn't work. For example, sometimes paywalls can trip up the system and sometimes it simply doesn't find the right body of text.
When the API encounters a new URL, the contents of the remote page have to be fetched. This can be time consuming because remote servers can be slow. Due to this we cache previously seen URLs responses for 1 day. If a cached copy has expired the remote page is fetched again. Also because of this overhead requests to the API can take several seconds. To combat this we've enabled a rate limit of 1 request per second. There are several instances of the API running behind a load balancer to ensure reliability.
The main 'heavy lifting' of extracting useful information and detecting the main body of text is done by two Python packages, python-goose and Newspaper. Massive thanks to the creators of these because they work really well! Our aim in releasing the API is to let anyone quickly make use of our infrastructure, caching and reliability for a reasonable price. If you are interested in creating your own version I would highly recommend using the above mentioned packages!
Specification
The API has one endpoint where you send a GET request with the url parameter. Please have a look at the API specification for more detail. The response is an JSON object something like this:
<pre>
{
"title": "Press me! The buttons that lie to you",
"author": "Chris Baraniuk",
"published": "2015-04-15T00:00:00.000Z",
"url": "http://www.bbc.com/future/story/20150415-the-buttons-that-do-nothing",
"image": "http://ichef.bbci.co.uk/wwfeatures/624_351/images/live/p0/2p/7f/p02p7fts.jpg",
"videos": [],
"keywords": ["control", "lie", "button", "system", "effect", "buttons", [...]],
"description": "Does it help to push the buttons on pedestrian crossings, [...]",
"body": "The tube pulls in to a busy station along the London Underground’s [...]"
}
</pre>
Note: the keywords, description and body fields were truncated for display purposes.
Use cases
- Monitoring incoming news streams for analysis
- News aggregators
- Read it later features
- Crawling non-uniform data sources
- Quickly parse your own sites content
If you're interested, the API is online and ready to be tried out! Please let us know any feedback or if you come across any issues.