Category: News feeds

Adding a Newsfeed to N2 CMS – Part 2 The XML Feed

Feed me!

A “News feed” is a cool-sounding term that basically means a text file (in a specific format of course). “Text file” does not sound anywhere near so cool.

Where does my feed come from?

That depends… If you have licenced a commercial new feed, like Adfero, you would have been given a URL pointing to your own personal download.

There are various free feeds available too, in varying formats, but I am focusing on the Adfero format for this example.

Adfero format

The Adfero file is an hiearchical XML file. It simply contains multiple Articles, each with multiple Categories and multiple Photos per article.

To process this structure my simplest solution was to create equivalent classes for each major element in the feed file and parse the file hierarchically:

  • FeedArticle
  • FeedCategory
  • FeedPicture*
    • FeedImage

*Note: a FeedPicture represents the picture, a FeedImage represents that picture in a specific size. This way you can process large and thumbnail image details from the feed file the same way.

Each class has enough members to store the incoming XML data and some simple helper getters to return useful items (like caption of 1st image, if any images are present). The classes and members are:

FeedArticle

  • NewsArticleId (int) – News article id (primary key in database)
  • ArticleNumber (string) – Unique article number from feed
  • Heading (string) – Article heading
  • Created (DateTime) – Date/Time article was created
  • Date (DateTime) – Publication date of article
  • Contents (string) – Contents/body or article
  • Summary (string) – Summary of article
  • Caption (string) – Shortcut: get caption of 1st image
  • PictureList (List<FeedPicture>) – List of 1 or more images, each with thumbnail and large
  • CategoryList (List<FeedCategory>) – List of categories for this article
  • LargeImageUrl (string) – Shortcut: get URL of 1st large image
  • LargeImageWidth (int) – Shortcut: get width of 1st large image
  • LargeImageHeight (int) – Shortcut: get height of 1st large image
  • SmallImageUrl (string) – Shortcut: get URL of 1st thumbnail image
  • SmallImageWidth (int) – Shortcut: get width of 1st thumbnail image
  • SmallImageHeight (int) – Shortcut: get height of 1st thumbnail image

FeedCategory

  • Found (bool) – For internal processing use – was this category found in the database (and for the current article)?
  • CategoryId (int) – Unique category id – database PK
  • CategoryNumber (string) – Unique category number (from feed)
  • CategoryName (string) – Display name of category

FeedPicture

  • Portrait (bool) – is the image narrower than tall?
  • AspectRatio (float) – Aspect ratio of width to height of large image
  • PhotoId (int) – Original photo id (from feed)
  • LargeImage (FeedImage) – The large image details (if any)
  • SmallImage (FeedImage) – The small image details (if any)
  • Caption (string) – A text caption for the images

FeedImage

  • ImageURL (string) – Image URL
  • ImageSize (Size) – Size of image
  • Width (int) – Width of image – helper
  • Height (int) – Height of image – helper
  • AspectRatio (float) – Ratio of width to height

Downloading the News feed

The .Net WebClient class is wonderful for downloading content asynchronously. Give it the URL and a local target filename and it does the download for you. If you attach an event hander to the DownloadFileCompleted event, you get informed when the file has been download. This means you now have an XML file on the server, ready to process like any other local file.

Eating the feed

The basic principal of parsing this XML feed follows the hierarchical structure of the feed.

  • Create an XmlTextReader
    • While there are more article nodes
    • Add each processed article to a list of all articles
    • Store the articles in the list

The article processing consists of checking child nodes and either storing attibutes (like Heading, Contents etc), or processing a group of child pictures or child categories. The resulting FeedArticle object is returned:

  • Create a blank FeedArticle
    • While we are processing nodes of the current article
      • if node name is “Heading”, store the value as the article heading
      • if node name is “Date”, store the value as the article date
      • if node name is “Contents”, store the value as the article contents
      • if node name is “Summary”, store the value as the article summary
      • if node name is “Pictures”, process any pictures for this article
      • if node name is “Categories”, process any categories for this article

The picture processing is similar and consists of checking child nodes and either storing attibutes (like PhotoTag etc), or processing a group of child images. The resulting FeedPicture objects are added to the FeedArticle:

  • Create a blank FeedPicture
    • While we are processing picture nodes
      • if node name is “Large”, process an image and store in the large FeedImage
      • if node name is “Small”, process an image and store in the small FeedImage
      • if node name is “PhotoTag”, store the value as the picture caption

The image processing is even simpler and consists only of checking child nodes and storing attibutes (like Width, Height and URL etc) so I will not bother with pseudo-code.

The categories processing consists of processing a group of child category nodes. The resulting FeedCategory objects are added to the FeedArticle.

The category processing (1 category) consists only of checking child nodes and storing attibutes (like ID and the text of the category name) so I will not bother with pseudo-code.

Storing the results

After running these parsing steps you wind up with a hierarchy of objects that you can iterate to run any processing you like. Typically for feed processing you will look for existing entries and see if they are new or changed. If nothing else this allow you to spit out statistics on how many new articles were added or changed from a given day’s feed.

Enough for now, in Part 3 we will look at the merging of the simple article tables, now updated by the previous steps, into the N2 CMS structure. That’s where the fun really begins…

Cheers, Dave

Adding a Newsfeed to N2 CMS – Part 1

Time for something a little more practical…

Adding a 3rd party newsfeed to a CMS is something that many may eventually want to do. Newsfeeds with daily-fresh content can add a lot of SEO value to a website. This basically involves downloading a formatted XML file (e.g. from a company like Adfero) and merging the articles into your CMS, so that they display in your lovely news archives and on your front page (and wherever else you want to see them).

This is the second time I have worked with adding Adfero news feeds to a system, so you can learn from my mistakes without having to make them yourself.

The previous project included a pass-through system, where links were dynamically inserted into the Adfero copy on-the-fly. The links were matched to keywords and phrases. A new URL was provided to clients to pickup the modified newsfeed, now containing hyperlinks embedded into the original articles.

The current project is a bit more straight forward, merging a newsfeed into a CMS system, but built on my knowledge of the previous project.

The secret to merging News feeds into a CMS is  “DO NOT merge the feed directly into the CMS database”. Honestly, there are so many things that can change (possibly even your entire CMS) that you want to keep this process separated into discrete reproducible/testable steps.

Stage 1 processing:

  • Run 1-4 times a day, depending on how lively your newsfeed is.
  • Download the Newsfeed file to a local file.
  • Also allow for specific local file processing as well (so you can restore from XML news archive files)
  • Extract a hierarchy of articles, categories and photos
  • Add new articles, categories and photos to the News tables
  • Update changed articles, categories and photos in the News tables
  • Ignore any completely unchanged articles

The end result of all this is a local database (usually 3 tables) of all past and current articles. Ready for you to do with what you wish!

Stage 2 processing (anytime after step 1):

  • Always run though the stored archive of all articles (does not take long, even for many 1000s).
  • Find the correct parent for the CMS item (there may be more than one… more about this in Part 2).
  • If an existing CMS item exists for an article, and has changed, update that CMS item.
  • If no existing CMS item exists for an article, create a new article.
  • If an existing CMS item exists for an article, and has not changed, do nothing.
  • If the CMS item is under the wrong parent (e.g. it has been deleted) move it back.

Day-to-day-data…

The 3 separate groups of data you will work with are

  • The XML Feed file
  • The local news database tables
  • The CMS database

Best to take this system one chunk at a time. In part 2 we will look at the XML feed format and describe how you can process it in a systematic way. The techniques used can be applied to many types of XML file.

Cheers, Dave

WordPress Themes