A “News feed” is a cool-sounding term that basically means a text file (in a specific format of course). “Text file” does not sound anywhere near so cool.
Where does my feed come from?
That depends… If you have licenced a commercial new feed, like Adfero, you would have been given a URL pointing to your own personal download.
There are various free feeds available too, in varying formats, but I am focusing on the Adfero format for this example.
The Adfero file is an hiearchical XML file. It simply contains multiple Articles, each with multiple Categories and multiple Photos per article.
To process this structure my simplest solution was to create equivalent classes for each major element in the feed file and parse the file hierarchically:
*Note: a FeedPicture represents the picture, a FeedImage represents that picture in a specific size. This way you can process large and thumbnail image details from the feed file the same way.
Each class has enough members to store the incoming XML data and some simple helper getters to return useful items (like caption of 1st image, if any images are present). The classes and members are:
- NewsArticleId (int) – News article id (primary key in database)
- ArticleNumber (string) – Unique article number from feed
- Heading (string) – Article heading
- Created (DateTime) – Date/Time article was created
- Date (DateTime) – Publication date of article
- Contents (string) – Contents/body or article
- Summary (string) – Summary of article
- Caption (string) – Shortcut: get caption of 1st image
- PictureList (List<FeedPicture>) – List of 1 or more images, each with thumbnail and large
- CategoryList (List<FeedCategory>) – List of categories for this article
- LargeImageUrl (string) – Shortcut: get URL of 1st large image
- LargeImageWidth (int) – Shortcut: get width of 1st large image
- LargeImageHeight (int) – Shortcut: get height of 1st large image
- SmallImageUrl (string) – Shortcut: get URL of 1st thumbnail image
- SmallImageWidth (int) – Shortcut: get width of 1st thumbnail image
- SmallImageHeight (int) – Shortcut: get height of 1st thumbnail image
- Found (bool) – For internal processing use – was this category found in the database (and for the current article)?
- CategoryId (int) – Unique category id – database PK
- CategoryNumber (string) – Unique category number (from feed)
- CategoryName (string) – Display name of category
- Portrait (bool) – is the image narrower than tall?
- AspectRatio (float) – Aspect ratio of width to height of large image
- PhotoId (int) – Original photo id (from feed)
- LargeImage (FeedImage) – The large image details (if any)
- SmallImage (FeedImage) – The small image details (if any)
- Caption (string) – A text caption for the images
- ImageURL (string) – Image URL
- ImageSize (Size) – Size of image
- Width (int) – Width of image – helper
- Height (int) – Height of image – helper
- AspectRatio (float) – Ratio of width to height
Downloading the News feed
The .Net WebClient class is wonderful for downloading content asynchronously. Give it the URL and a local target filename and it does the download for you. If you attach an event hander to the DownloadFileCompleted event, you get informed when the file has been download. This means you now have an XML file on the server, ready to process like any other local file.
Eating the feed
The basic principal of parsing this XML feed follows the hierarchical structure of the feed.
- Create an XmlTextReader
- While there are more article nodes
- Add each processed article to a list of all articles
- Store the articles in the list
The article processing consists of checking child nodes and either storing attibutes (like Heading, Contents etc), or processing a group of child pictures or child categories. The resulting FeedArticle object is returned:
- Create a blank FeedArticle
- While we are processing nodes of the current article
- if node name is “Heading”, store the value as the article heading
- if node name is “Date”, store the value as the article date
- if node name is “Contents”, store the value as the article contents
- if node name is “Summary”, store the value as the article summary
- if node name is “Pictures”, process any pictures for this article
- if node name is “Categories”, process any categories for this article
The picture processing is similar and consists of checking child nodes and either storing attibutes (like PhotoTag etc), or processing a group of child images. The resulting FeedPicture objects are added to the FeedArticle:
- Create a blank FeedPicture
- While we are processing picture nodes
- if node name is “Large”, process an image and store in the large FeedImage
- if node name is “Small”, process an image and store in the small FeedImage
- if node name is “PhotoTag”, store the value as the picture caption
The image processing is even simpler and consists only of checking child nodes and storing attibutes (like Width, Height and URL etc) so I will not bother with pseudo-code.
The categories processing consists of processing a group of child category nodes. The resulting FeedCategory objects are added to the FeedArticle.
The category processing (1 category) consists only of checking child nodes and storing attibutes (like ID and the text of the category name) so I will not bother with pseudo-code.
Storing the results
After running these parsing steps you wind up with a hierarchy of objects that you can iterate to run any processing you like. Typically for feed processing you will look for existing entries and see if they are new or changed. If nothing else this allow you to spit out statistics on how many new articles were added or changed from a given day’s feed.
Enough for now, in Part 3 we will look at the merging of the simple article tables, now updated by the previous steps, into the N2 CMS structure. That’s where the fun really begins…