Reuters NewsML Syndication Integration

The manual offered by Reuters explains how to write Java code to inform your web application of new stories from Reuters and how to extract the information from the NewsML files for your web site. This blog post will illustrate another way of integration by scanning the syndication folder for the NewsML files. The alternate method can be implemented in various programming languages such as Python, Perl, PHP or Ruby.

Introduction

Reuters Online Syndication offers a suite of ready to publish news stories for subscription by corporates. These news stories are categorized by geographical regions and by interests such as financial and sports. The syndication is not the same as the RSS feeds that you see on reuters.com; Reuters Online Syndication will push the latest breaking news, associated graphics and award-winning pictures to your site.

The Syndication Software

Your Reuters representative will pass you an URL or CD which contains the syndication software and your license key. The syndication software should be in a zipped or jar file, and will resemble something like this.

1. Unzip the zipped/jar file. Then create a license.key file in the same folder as the siclone.config. The first line of the license.key will contain the license key.

2. Now create a shell script in the same folder as the license.key

#!/bin/sh
PATH=.:$PATH
finish-upgrade.sh
launch-program.sh

With this shell script, the syndication software can be run on a command-line or as a service without the need for a graphical user interface. You can customize the shell script into a batch file for MS-DOS/Windows

3. Execute the shell script and when the syndication software is running, you should see a bunch of folders being created. Examples of these new folders are OLGBWORLD_iptc and OLUSPOLITIC_iptc. You will also see some XML and JPG files in these new folders. The XML files are the NewsML files which contain the Reuters stories while the JPG are the pictures and photos for some of the stories.

The folders or news channels are controlled by Reuters, so if you need more news channels, you need to call your Reuters representative. Each folder corresponds to one news channel. When the syndication software is running, it will write its status into siclone.config.

NewsML Files

Reuters has a good showcase for the NewsML format. The showcase contains some example NewsML packages which you can download to get a feel what are the key attributes in the NewsML files. You can also try out the NewsML toolkit from sourceforge.

By now, you can start coding your own application in your favourite programming language to scan the news folders at regular intervals for NewsML files. Once the syndication software has completed writing and closed the NewsML files, the files can be deleted from the folders without affecting the syndication software. Occasionally, you may come across files with filenames ending in .tmp. These are temporary files that the syndication software are using to dump the NewsML contents. Your application should ignore these temporary files.

The NewsML file contains numerous attributes which can be quite daunting initially. For starters, you can use a XML parser library that is available in your programming language and use it to parse out these attributes first:

  • DateTime
  • Role
  • USN
  • Slug
  • Headline
  • body

If you need to understand the classification for the story, you may also want to parse out the IPTCSubjectCodes and N2000 attributes as well. For more information on these attributes, call your Reuters representative :-)

All the contents (nodes) within the ... tag-pair will be the story that can be used to display to your audience. It is a well-formed xHTML document, so it can be included directly in any web page. In some stories, you will also find some RICs that are embedded within the body. You can replace these RICs using some regular expressions or convert them to valid URLs that link to the pages displaying the financial information for these codes.

This blog post has illustrated an alternative method of integrating with the Reuters syndication software which allows you to write your own NewsML processing routines in a non-Java language. It has also highlighted some NewsML attributes which you can look at first to start setting up your news site.