Monday, April 4, 2011

Using regex on a Drupal RSS Feed

A brief intro on regex

regex or "Regular Expressions" are used just about everywhere. When you do a search on google, you are using a form of regular expression. When you do a find in a document, you are also using a form of regular expression.
What is a regular expression, for the un-initiated? Simply put, a regular expression is a way to write a concise definition that matches characters, words, and pattern in strings of text and documents. I won't go into a tutorial about "regex" - pronounced "rej-ex", you can read up about it at Wikipedia if you want to know more.

Testing your expressions

I've used regex for many years and many forms, and I've always struggled with it. It's kind of obtuse, and I've seen regular expressions, for verifying a valid e-mail address, that are hundreds of charachters long. It can get really complicate really quickly. In experimenting with using Regex as a parser in the Drupal Feed Importer stack, I ran across a very awesome tool that allows you to test out your regular expressions. It's RegExr at gskinner.com, and it's one of the best regex testing tools I've found. It even shows you your grouping matches, does multiline, and lots of other neat stuff.

Install Drupal Modules

To use regex on a news feed import in drupal, you'll first need the Feeds Module. This module allows you to import news feed items and elements into drupal nodes. Next you'll need the Feeds Regex Parser. This little module is great if you have a news feed that doesn't have the field parsed out all nice for you. In my case, I had a long address in one string and I need to pull the street address, city and state out. With the Feeds Regex Parser, you can import pieces from pattern matches using groupings.
For example, say I have a news feed that looks something like this:
<feed>
<item>
<address>123 Anystreet, Yourtown, CA</address>
</item>
<item>
<address>234 Anotherstreed, Anothertown, MI</address>
</item>
</feed>
And I want to pull out the address part of this without the town and state, and make it the title of the node I create from the feed.

Add New News Feed Importer

First create a new feed in Administer->Site Building->Feed Importers.
Give the feed a name and a description, then set up the feed on this screen:

Click on the basic settings and make sure you have the content type set as "Use standalone form" and that the Minmum refresh period is "Never". You can change the refresh period to your liking once it's all tested and working.

The fetcher should default to "HTTP Fetcher", so you don't need to make any changes here.
Next click change next to parser, then click "Select" next to the Regex Parser entry:

Once you've chosen the regex parser you can move on to the node processor settings, since the regex parser doesn't have any settings. Pick the content type you want to create with each feed item. I have chosen page, but if you choose feed, or feed item, your post will get created as part of an individual feed that has multiple nodes. I just want one page with no attachments, per news feed item, so I have chosen content type of "Page".

You can of course flavor the above to taste. Next is creating your mapping, which is pretty simple. You just select the only choice from the pulldown on the left, then select the field you want to load it into. Create as many entries as you need. We only need to pull the street address, so we'll create one.
We also need a unique identifier for this item, so we don't keep creating a new document every time we see it in the news feed. I'm using GUID. Notice I've checked the box on the GUID line to make it my unique target after I added it.

Import Your Data

You're almost there. Next you click on the top level menu item "Import", then click on the name of the importer you created. You can now enter your regular expressions. I've over simplified mine, and these most definately could be better in terms of construction, but I'm not here to discuss elegant regex writing. What I want to demonstrate below, is that your context, is your "Record Context", i.e. in what context will all of your record chopping regex occur?
So in the example below, my context shows <item></item> and then I am free to parse out my fields below that.
Lastly, enter the URL for your feed, then hit the import button and you are good to go!!
Anything that is specified in your grouping (In between the parenthesis) will be loaded into the field you specified in your mapping.

Caveats and Patch

You can use preg_match_all switches below, but I was not able to get them to work as expected. Also you can only have one group so a regex like the following won't work:
/$(.*),(.*)/
It will take whatever is in the first group.
I did write a small patch that concantenates the groups into one string for loading
In the FeedsREGEXParser.inc file in the module. Modify the following:
if (isset($matches[1])) {
return $matches[1];
} else {
return $matches[0];
}
}
To Be:
if (isset($matches[1])) {
$retval = "";
for ( $i = 1; $i < count($matches); $i++ ) {
$retval .= $matches[$i];
}
return $retval;
/* return $matches[1]; */
} else {
return $matches[0];
}
}
I'll submit this patch to the creator of the module for possible future inclussion, but in the meantime, it can help you strip comma's out of numbers and the like. So for example, something like:
/\$(.*),(.*)/
When applied to a string of "$100,000"
Will result in the number:
100000
Which can then be happily loaded into any numeric field.
-MT

4 comments:

jamielee said...

Thanks for the great example. I appreciate it.

I am trying to reproduce your example with little luck, though.

What version of drupal are you using?

Also, I noticed in the xml sample that no tags are present for GUID. Are they implied?

Is standalone mode required?

GeekTravels said...

errr... Drupal 6 :) I haven't tried this yet on 7. There were a couple of issues with the modules on 7.

What problems are you having?

jamielee said...

I am not able to import anything. No errors, just no nodes are imported. I am trying to parse one piece of data from a stream (link), with a similar XMl structure to your example.

context = /<item>.*\s.*<\/item>/

variable regex - link
/<link>(.*)<\/link>/

GeekTravels said...

Look at the XML source of your feed and copy and paste two or more records into the RegExr utilitiy linked to above.

Then put your context string in the RegExr (without the slashes I believe) and test your context to make sure it is getting the entire record.

Your subsequent RegexParsers in Drupal will search within this context. You may also want to use \S or \s depending on whether or not you have new lines in your source record, and you may need multiple \S or \s directives. My Regex is not the best, so you'll have to play around w/ it a bit, but the RegExr tool is invaluable in debugging this process.