
Periodic Table of Elements
Element Reference
Main section
<feed>
General feed information
<info>
<info.name>
<info.description>
<info.author>
<info.created>
Scheduling section
<scheduling>
<schedule>
<hours>
<minutes>
<days>
<months>
<weekdays>
<scheduling.content_id>
Fetch section
<fetch>
<fetch.location>
<fetch.authentication>
Input section
<input>
<section>
<source>
<actions>
<content_id>
<parser>
<filter>
Output
<output>
<output.instance>
<output.map>
Parsers and Filters
Parsers
Regex Parser
XML Parser
Filters
Paragraph_breaks
Regex
Remove_font_tags
Remove_tags
Trim_whitespace
Testing and Uploading Configuration Files
DTD
Sample Configuration Files
| Element | Contents | Attributes | Allowed In |
|---|---|---|---|
| <feed> |
info scheduling fetch input output |
version (required) | n/a |
| <info> |
info.name (required) info.description info.author info.created info.categories (required) |
none | feed |
| <info.name> | Plain text | n/a | info |
| <info.description> | Plain text | n/a | info |
| <info.author> | Plain text | n/a | info |
| <info.created> | Plain text | n/a | info |
| <scheduling> |
schedule (1+) scheduling.content_id (1+) |
n/a | feed |
| <schedule> |
hours (multiple allowed) minutes (multiple allowed) days (multiple allowed) months (multiple allowed) weekdays (multiple allowed) |
label (required) | scheduling |
| <hours> | Plain text | n/a | schedule |
| <minutes> | Plain text | n/a | schedule |
| <days> | Plain text | n/a | schedule |
| <months> | Plain text | n/a | schedule |
| <weekdays> | Plain text | n/a | schedule |
| <scheduling.content_id> | n/a |
name (required) use_schedule (required) expires_after last_updated |
scheduling |
| <fetch> |
fetch.location (required) fetch.authentication |
type | feed |
| <fetch.location> | n/a |
hostname starting_dir |
fetch |
| <fetch.authentication> | n/a |
username password |
fetch |
| <input> | section (1+) | n/a | feed |
| <section> |
content_id (1+) source (1+) actions (multiple allowed) |
description new_data_item label use |
input parser source |
| <source> |
content_id (1+) section (1+) |
type (required) field value |
section |
| <actions> |
content_id (1+) parser or filter or section (1+) |
label use |
input |
| <content_id> | n/a | name (required) |
section source actions parser filter |
| <parser> |
content_id (1+) section (1+) |
type (required) input_field output_field (required) max_loops args |
actions |
| <filter> |
content_id (1+) section (1+) |
type (required) fields (required) max_loops args |
actions |
| <output> | output.instance (1+) | n/a | feed |
| <output.instance> | output.map (1+) |
type args |
output |
| <output.map> | n/a |
source (required) to (required) |
output.instance |
In the context of a DTD (Document Type Definition), an element's contents are
tags that are nested between the current element's opening and closing tags.
Attributes are values defined within the tag itself. An empty tag (a tag that
has attributes but no contents) can be closed within the opening tag by adding
a slash at the end, i.e., <tag value="example" />.
Top-level element defining the entire feed configuration. Each element is
optional, but the <input> and <output>
elements are used only in conjunction with the <fetch>
element. If <fetch> is not present,
<input> and <output> are invalid.
Contents:
<info>,<scheduling>,<fetch>,<input>,<output>Attributes:
version -Version of configuration file DTD (required)Usage:
<feed version="0.1">...</feed>
These settings allow you to specify general information about the feed: name, description, author, etc.
Top-level tag containing general feed information.
Contents:
<info.name>(required),<info.description>,<info.author>,<info.created>,<info.categories>(required)Attributes: n/a
Allowed Inside:
<feed>Usage:
<info>...</info>
Specifies the name of the content provider.
Contents: plain text
Attributes: n/a
Allowed Inside:
<info>Usage:
<info.name>Content Provider Name</info.name>
Brief feed description.
Contents: plain text
Attributes: n/a
Allowed Inside:
<info>Usage:
<info.description>Feed from ContentProvider.com</info.description>
Author of the current configuration file.
Contents: plain text
Attributes: n/a
Allowed Inside:
<info>Usage:
<info.author>John P. Author</info.author>
Configuration file creation date. The only date format currently supported is YYYYMMDD.
Contents: plain text
Attributes: n/a
Allowed Inside:
<info>Usage:
<info.created>20000804</info.created>
Contains information about the scheduling of the individual content_ids of the feed. Includes info about article expiration within a category, execution schedule of the category and when the category was last updated.
Top-level tag containing scheduling information.
Contents:
<schedule>, (one or more)<scheduling.content_id>(one or more)Attributes: n/a
Allowed Inside:
<feed>Usage:
<scheduling>...</scheduling>
Scheduling for automated scrape. Several schedules can be set up and saved with different names, in the event that different feeds need to be dealt with at different times.
The following is an example of a schedule block:
<schedule label="Thursdays"> <hours>3</hours> <minutes>00</minutes> <days>*</days> <months>*</months> <weekdays>Thursday</weekdays> </schedule>
Contents:
<hours>(multiple allowed),<minutes>(multiple allowed),<days>(multiple allowed),<months>(multiple allowed),<weekdays>(multiple allowed)Attributes:
label- name for specified schedule block (required)Allowed Inside:
<scheduling>Usage:
<schedule label="daily">...</schedule>
Hourly scheduling. Accepted values are 0-23. Wildcards (*) are allowed.
Contents: Plain text
Attributes: n/a
Allowed Inside:
<schedule>Usage:
<hours>3</hours>
Minute to minute scheduling. Accepted values are 00-59. Wildcards (*) are allowed.
Contents: Plain text
Attributes: n/a
Allowed Inside:
<schedule>Usage:
<minutes>00</minutes>
Daily scheduling. Accepted values are 1-31. Wildcards (*) are allowed.
Contents: Plain text
Attributes: n/a
Allowed Inside:
<schedule>Usage:
<days>*</days>
Monthly scheduling. Accepted values are 1-12. Wildcards (*) are allowed.
Contents: Plain text
Attributes: n/a
Allowed Inside:
<schedule>Usage:
<months>*</months>
Weekly scheduling. Accepted values are Monday-Sunday. Wildcards (*) are allowed.
Contents: Plain text
Attributes: n/a
Allowed Inside:
<schedule>Usage:
<weekdays>Tuesday</weekdays>
Contains information about the feed, including the feed ID and expire
time. The use_schedule value should correspond to the label on
the appropriate <schedule> block.
Contents: none
Attributes:
name- iSyndicate feed name (required)use_schedule- name of schedule to use (required)expires_after(future use)last_updated(future use)Allowed Inside:
<scheduling>Usage:
<scheduling.category name="some_feed.ft" use_schedule="weekdays" />
Defines the fetching of data via various means (http, ftp, email, local file, etc.). Not all elements are required for all feed types.
Top-level tag containing fetch information.
Contents:
<fetch.location>(required),<fetch.authentication>Attributes:
type-ftp,http,file(required)Allowed Inside: <feed>
Usage:
<fetch type="http">...</fetch>
Specifies the location from which feed data will be scraped.
Contents: none
Attributes:
hostname- feed site hostname (required)starting_dir- starting directory for feed scrape (optional)Allowed Inside:
<fetch>Usage:
<fetch.location host="http://www.contentprovider.com" starting_dir="/syndication/">
Username and password information, if authentication is required.
Contents: none
Attributes:
username(required)password(required)Allowed Inside:
<fetch>Usage:
<fetch.authentication username="username" password="password">
Processing for feed input. These elements allow you to define blocks of information and perform various operations on them.
The following is an example of an input block. Click on a line for a description of its purpose.
<input>
<section description="get the xml file">
<content_id name="nytimes.books.hl" />
<source type="url" value="wireless_books.xml">
<actions label="do for each content_id">
<parser type="xml" output_field="article" args="tags=Article">
<section new_data_item="article">
<actions>
<parser type="xml" output_field="headline args="tags=Headline"/>
<parser type="xml" output_field="url" args="tags=URL"/>
</actions>
</section>
</parser>
</actions>
</source>
</section>
<section>
<content_id name="nytimes.business.hl"/>
<source type="url" value="wireless_business.xml"/>
<actions use="do for each content_id"/>
</section>
</input>
Top-level element containing input information.
Contents:
<section>(multiple permitted)Attributes: n/a
Allowed inside:
<feed>,<source>Usage:
<input>...</input>
Enables logical grouping of sub-tags. The 'label' attribute allows the section to be marked for re-use later in the configuration file. This is useful if the same operations must be performed on a number of different feeds from the same source.
Contents:
<content_id>(multiple allowed),<source>(multiple allowed),<actions>Attributes:
description- brief description (required)new_data_item- assign name to current data and treat as a new sectionlabel- name for section to be re-useduse- name of section to re-useAllowed Inside:
<input>,<parser>Usage:
<section description="Get index page">...</section>
Specifies the type of feed source (url or field).
Contents:
<content_id>(multiple allowed)Attributes:
type-urlorfield(required)value- location of data sourceAllowed Inside:
<section>Usage:
<source type="field" field="title" value="Downloads" />
Groups parsing and filter actions together into a logical block, which can be
labeled for re-use with the label attribute. The use
attribute instructs the engine to use the specified block.
The example below illustrates how to re-use an actions block:
<stuff>
<action label="do_useful_stuff">
<parser.../>
<filter.../>
</action>
</stuff>
<more_stuff>
<action use="do_useful_stuff">
</more_stuff>
Contents:
<content_id>(multiple allowed),<parser>or<filter>or<section>(multiple allowed)Attributes:
label- specifies a name for the block to be saved withuse- name of block to re-useAllowed Inside:
<input>Usage:
<actions start_article="yes">...</actions>
Specifies the feed ID.
Contents: none
Attributes:
name(required)Allowed Inside:
<section>,<source>,<actions>,<parse>,<filter>Usage:
<content_id name="feature_story_news.global.ft" />
There are currently two kinds of parsers: an XML parser and a regular expression parser. These parsers extract information from content and return the extracted information.
Contents:
<content_id>(multiple allowed),<section>(multiple allowed)Attributes:
type-XMLorRegex(required)input_field- field from which to retrieve dataoutput_field- field for output data (required)max_loops- maximum number of times to run the parser on each piece of informationargs-regex(Regex parser),tags(XML parser),attributes(XML parser)Allowed Inside:
<actions>Usage:
<parser type="regex" max_loops="20" input_field="INPUT" output_field= "article" args="regex=(.*?\n{2,}); destructive=yes">
Content filtering utilities. Only one filter can be used per element; to use
multiple filters, specify each one in its own element. Currently only text
filtering utilities have been implemented. Future implementations will include
filters for images, HTML, and other assorted types. To specify the type of
filter being used (text, image, etc.), preface it with the type, i.e.,
type="text->paragraph_breaks".
Contents:
<content_id>(multiple allowed),<section>(multiple allowed)Attributes:
type-paragraph_breaks,regex,remove_font_tags,remove_tags, ortrim_whitespace(required)fields- field for output data (required)max_loops- maximum number of times to run the filter on each piece of informationargs- see Filters for more info.Allowed Inside:
<actions>Usage:
<filter type="Text->Remove_tags" fields="body" args="modifiers=gi" max_loops="10" args="foo">...</filter>
Defines how to output information.
Top-level tag containing output information.
Contents:
<output.instance>(one or more)Attributes: n/a
Allowed Inside:
<feed>Usage:
<output>...</output>
Specifies the output type and contains tags specifying any additional output information.
Contents:
<output.map>(one or more)Attributes:
type-file,Network_articles, orExpress_articlesargs- additional arguments (future use)Allowed Inside:
<output>Usage:
<output.instance type="Network_articles">...</output.instance>
Maps internal fields to output fields.
Contents: none
Attributes:
source- source of output fieldto- field to output data toAllowed Inside:
<output.instance>Usage:
<output.map source="article->title" to="title" />
The regex parser searches for regular expressions, whether matching or substitution, and returns the results. Important: characters reserved in XML must be replaced with the corresponding HTML/XML character entity.
Parser Arguments:
regex- regular expression to search forUsage:
<parser type="regex" max_loops="20" input_field="INPUT" output_field="article" args="regex=(.*?\n{2,});destructive=yes">
The XML parser searches data for the specified XML tags, and attributes within the specified tags.
Parser Arguments:
- tags - specific XML tags to search for
- attributes - attributes to search for within the specified tags
Usage:
<parser type="xml" output_field="feed_name" args="tags=feed;attribute=name" />
Ensures articles are unique based on specified criteria.
Filter Arguments:
fields- determines which article fields should be compared for uniqueness. Example: We have 20 articles, and we want to make sure that none have the same title. We specifyfields="title". When the filter finds an article with the title 'News headlines', it discards any additional articles with the same title.Usage:
<filter type="Text->Article_uniqueness" fields="title" />
Fixes paragraph breaks in text; \n\n gets replaced with <P>.
Filter Arguments:
break- allows specification of something other than \n\n as the delimiter. (For example, you might want to break when there are 3 or more carriage returns or newlines in a row ( "break=[\r\n]{3,}" ).)modifiers- can be any of Perl's regular expression modifiers ('s','i','g', etc.) with their corresponding functionality. Many can be combined. Example: you want to filter all the paragraph breaks in a piece of text. Use the 'g' (global) modifier.Usage:
<filter type="Text->Paragraph_breaks" break="[\r\n]{3,}" modifiers="gi" />
Regular expression based filtering.
Filter Arguments:
regex- specifies the regular expression to apply to the input. Can be any valid Perl RE.Usage:
<filter type="Text->Regex" fields="body" args="regex=s#[\n\r]{2,}#\n<P>\n#sgi" /n>
Removes <FONT...>, </FONT> tags from text passed to it.
Filter arguments: n/a
Usage:
<filter type="text->remove_font_tags" fields="all">
Removes tags from text.
Filter arguments:
tags- tags that should be removed. Both start and end tags are removed. Currently you can only specify one type of tag per filter. Saying "tags=td,tr" will not work. Example: "tags=td" will get rid of all<TD...>and</TD...>tags.modifiers- can be any of Perl's regular expression modifiers ('s','i','g', etc.) with their corresponding functionality. Lots of the them can be combined. Example: you want to filter all the paragraph breaks in a piece of text. Use the 'g' (global) modifier.Usage:
<filter type="text->remove_tags" fields="all" args="tags=td" />
Removes excess whitespace from text
Filter arguments: n/a
Usage:
<filter type="text->trim_whitespace" fields="all" args="modifiers=gi" />
All commands to test and upload configuration files must be issued as the
dbadmin user on scrapestage. Everyone must have the proper
sudo privileges. Information about sudo can be found
here. If you do not
have the correct privileges, see
Mattias.
This section is necessarily incomplete, because this environment will change with time.
Testing a configuration file currently consists of running the feed with as
much error info turned on as possible. For information on how to use the
scrape command, just type scrape at the command
prompt:
[mbottner@alamo ~]$ scrape
Usage : scrape [-v[v[v]]|-q|-h] file
file name of the configuration file for the feed you want to run
-h this message
-q run the feed quietly (i.e. only fatal errors are reported)
-v determines how much information the feed should return.
For more info, use more 'v's (i.e. -vv or -vvv)
To test a configuration file, make your telnet window at least 120 columns wide and type the following command:
[mbottner@scrapestage network]$ sudo -u dbadmin scrape -vvv config_file
Note that the command is being run as dbadmin by virtue of the
sudo -u dbadmin command.
Much information will spew forth from the scrape engine onto your screen, and you will be overwhelmed. If this is the first time you're testing the configuration file, you will probably encounter an error message, as in the following example:
[mbottner@scrapestage network]$ sudo -u dbadmin scrape -vvv config_file Using built in validity checks Config file error: Unrecognized attribute 'host' in tag 'fetch.location': | | <fetch type="http"> | <fetch.location host="http://www.channel2.co.uk" starting_dir="/xml" /> |====================^ | </fetch>
This means there is an error in the configuration file that the engine could catch, and tells you where the error is. Occasionally the error message will point you to the wrong place, which can be frustrating, but this should be rare. Once you have corrected the error, rerun the feed. If there are more errors, keep fixing them until it works. Eventually you will see output that looks something like this:
[mbottner@scrapestage network]$ sudo -u dbadmin scrape -vvv bridge_news
Using built in validity checks
Main scrape engine started | |
| |
Initializing fetch system based on | |
parameters from config file | |
| |
Fetch system configuration: | |
| |
Type : http | <fetch type="http"> |
| |
Hostname : http://www.bridge.com | <fetch.location hostname="http://www. |
StartingDir : bio | bridge.com" starting_dir="bio"/> |
| |
Starting input processing... | <input> | | |
Executing section | <section description="Get the |
| articles out of the index page"> |
| |
Executing source | <source type="url" value="biostory. |
| asp?story=8882&fixed=no& |
| newwindow=false"/> |
| |
Fetching data from url (http://www. | |
bridge.com/bio/biostory.asp?story=8882 | |
&fixed=no&newwindow=false) | |
....
All of this output is generated as the engine runs the feed. It provides information about what is actually going on behind the scenes, with the commands being executed in the left column, and the corresponding XML from the configuration file in the center column. This information will be more detailed in future releases.
Note: There are currently many issues with debugging. The output section of the feeds does not work on scrapestage due to several missing tables in the database. Mattias will be working with ops to synchronize the scrapestage database on a regular basis.
Once the configuration file is running properly, it is ready to be uploaded to the production system.
Uploading is fairly simple. Using the command push_feed, the
specified configuration file is copied from the staging machine (scrapestage)
to the production machine (alamo) and inserted into the crontab. If the upload
fails, a message will alert you and specify the reason for the failure. If the
error message is unclear, contact
Mattias for help.
For help on how to use the push_feed command, type the name of
the command:
[mbottner@scrapestage express]$ push_feed
Usage : /usr/local/bin/push_feed [-v[v[v]]|-q|-h] file
file name of the configuration file you want to push live
-h this message
-q push the feed quietly (i.e. only fatal errors are reported)
-v determines how much information the push script should return.
For more info, use more 'v's (e.g. -vv or -vvv)
Please note that when pushing a feed into production you must specify the feed name in relation to the config directory. For example:
[mbottner@scrapestage network]$ sudo -u dbadmin push_feed -vvv network/test File '/usr/local/ft/config/network/test' successfully transferred to alamo
Even if you are already in the directory
/usr/local/ft/config/network/test,
you must specify push_feed -vvv network/test. In future releases
this is likely to change to behave more like other commands.
In this example, the push was successful, meaning that the configuration file was copied over to alamo and installed in the crontab.
This is a preliminary DTD for feed configuration files. Once the system is set
up, the DTD should be able to change depending on available options for filters
and modules in the system. For example, if a new text filter is added to the
system, the DTD would reflect its availability as a filtering option.
If the module hierarchy is set up properly, it should just be a matter of
dropping a module into the proper directory to make it available (assuming that
it has implemented its interfaces correctly). The system should be able to see
it by querying all the modules in the directory.
#-------------------------------------------------------------------------------
# Main section
#-------------------------------------------------------------------------------
<!ELEMENT feed (info?, scheduling?, (fetch, input?, output?)?) >
<!ATTLIST feed version
CDATA #REQUIRED >
#-------------------------------------------------------------------------------
# General feed information:
# Section of DTD which contains general information about the feed: name,
# description, author, etc.
#-------------------------------------------------------------------------------
<!ELEMENT info (info.name, info.description?, info.author?, info.created?,
info.categories)>
<!ELEMENT info.name (#PCDATA)>
<!ELEMENT info.description (#PCDATA)>
<!ELEMENT info.author (#PCDATA)>
<!ELEMENT info.created (#PCDATA)>
#-------------------------------------------------------------------------------
# Scheduling section
# Contains information about the scheduling of the individual content_ids of
# the feed. Includes info about article expiration within a category, execution
# schedule of the category and when the category was last updated.
#-------------------------------------------------------------------------------
<!ELEMENT scheduling (schedule*, scheduling.content_id*)>
<!ELEMENT schedule (hours?, minutes?, days?, months?, weekdays?)>
<!ELEMENT scheduling.content_id EMPTY>
<!ELEMENT hours (#PCDATA)>
<!ELEMENT minutes (#PCDATA)>
<!ELEMENT days (#PCDATA)>
<!ELEMENT months (#PCDATA)>
<!ELEMENT weekdays (#PCDATA)>
<!ATTLIST schedule
label CDATA #REQUIRED >
<!ATTLIST scheduling.content_id
name CDATA #REQUIRED
use_schedule CDATA #REQUIRED
expires_after CDATA #IMPLIED
last_updated CDATA #IMPLIED >
#-------------------------------------------------------------------------------
# Fetch section
# Defines the fetching of data via various means (http, ftp, email, local
# file, etc.). Not all elements are needed for all feed types.
#-------------------------------------------------------------------------------
<!ELEMENT fetch (fetch.location, fetch.authentication?) )>
<!ELEMENT fetch.location EMPTY>
<!ELEMENT fetch.authentication EMPTY>
<!ATTLIST fetch # Determines how we get data
type (ftp|http|email|file) #REQUIRED >
<!ATTLIST fetch.authentication # If authentication is needed to pick up data
username CDATA #REQUIRED
password CDATA #REQUIRED >
<!ATTLIST fetch.location # Where we get data from
hostname CDATA #REQUIRED
starting_dir CDATA >
#-------------------------------------------------------------------------------
# Input section
# Instructions for feed input.
#-------------------------------------------------------------------------------
<!ELEMENT input (section*)>
<!ELEMENT section (content_id*, source*, actions?)>
<!ELEMENT source (content_id*, section*)>
<!ELEMENT actions (content_id*, (parser|filter|section)*)>
<!ELEMENT content_id EMPTY>
<!ATTLIST section
description CDATA #IMPLIED
new_data_item CDATA #IMPLIED
label CDATA #IMPLIED
use CDATA #IMPLIED >
<!ATTLIST source
type CDATA #REQUIRED
value CDATA #IMPLIED >
<!ATTLIST actions
label CDATA #IMPLIED
use CDATA #IMPLIED >
<!ATTLIST content_id
name CDATA #REQUIRED >
#-------------------------------------------------------------------------------
# Parsing
#-------------------------------------------------------------------------------
<!ELEMENT parser (content_id*, section*)>
<!ATTLIST parser
type CDATA #REQUIRED
input_field CDATA #IMPLIED
output_field CDATA #REQUIRED
max_loops CDATA #IMPLIED
args CDATA #IMPLIED >
#-------------------------------------------------------------------------------
# Filtering
#-------------------------------------------------------------------------------
<!ELEMENT filter (content_id*, section*)>
<!ATTLIST filter
type CDATA #REQUIRED
fields CDATA #REQUIRED
max_loops CDATA #IMPLIED
args CDATA #IMPLIED >
#-------------------------------------------------------------------------------
# Output section
# Defines how to output information
#-------------------------------------------------------------------------------
<!ELEMENT output (output.instance*)>
<!ELEMENT output.instance (output.map*)>
<!ELEMENT output.map> # Mapping internal fields to output fields
<!ATTLIST output.instance
type (File|Network_articles|Express_articles) # Handler type
args CDATA #IMPLIED > # Additional arguments for output instance
<!ATTLIST output.map
source CDATA #REQUIRED # Source of the data to output
to CDATA #REQUIRED > # Where to output data. Depends on the
# type of output instance that is doing the work
<feed version="0.1">
<info>
<info.name>Feature Story News: Global briefing</info.name>
<info.description>Daily email feed from FSN</info.description>
<info.author>Mattias Bottner</info.author>
<info.created>20000714</info.created>
</info>
<scheduling>
<scheduling.category name="feature_story_news.global.ft" expires_after="1 week" executes="auto" />
</scheduling>
<fetch type="email">
<fetch.location host="email://localhost" starting_dir="" />
</fetch>
<input>
<section description="Parse email to extract articles">
<content_id name="feature_story_news.global.ft" />
<source type="field" field="INPUT" value="" />
<actions>
<filter type="text->substitute" fields="INPUT" args="old=^.*?USA\*\*\*\n*; new=" />
<filter type="text->substitute" fields="INPUT" args="old=\*\*\*EUROPE\*\*\*\n*; new=" />
<filter type="text->substitute" fields="INPUT" args="old=\*\*\*RUSSIA\*\*\*\n*; new=" />
<parser type="regex" max_loops="20" input_field="INPUT" output_field="article" args="regex=(.*?\n{2,});destructive=yes">
<section description="Extract the fields from each article">
<source type="field" field="article" value=""/>
<actions start_article="yes">
<parser type="regex" max_loops="1" input_field="article" output_field="title" args="regex=^([^\n]+)\n.*;destructive=no" />
<parser type="regex" max_loops="1" input_field="article" output_field="body" args="regex=^[^\n]+\n(.*);destructive=no" />
<filter type="text->trim_whitespace" fields="ALL" />
</actions>
</section>
</parser>
</actions>
</section>
</input>
<output>
<output.instance type="network_feed" />
</output>
</feed>
<feed version="0.1">
<info>
<info.name>Copley news</info.name>
<info.description>Daily feed from copley</info.description>
<info.author>Mattias Bottner</info.author>
<info.created>20000714</info.created>
</info>
<scheduling>
<scheduling.category name="copley.arts.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.auto_world.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.your_business.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.career.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.classical.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.alert_consumer.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.hollywood.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.everybody.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.film_close.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.films_focus.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.financial.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.fitness_forum.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.fyi_travel.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.george.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.harry.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.heart.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.jackkemp.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.lifewire.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.money.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.movie_reviews.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.world_music.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.net_sitings.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.neilr.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.nutrition.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.news.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.inside_people.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.pop_talk.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.bookshelf.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.social_security.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.adventure.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.travel_history.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.her_world.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.tv_close.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
<scheduling.category name="copley.video_reviews.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
</scheduling>
<fetch type="http">
<fetch.authentication username="" password=""/>
<fetch.location host="http://www.copleynews.com" starting_dir="/isyndicate" />
</fetch>
<input>
<section description="Get index page">
<source type="url" field="" value="Downloads" />
<actions>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(ArtsLeis\d{6}\.htm)".*?</A>##s"> <content_id name="copley.arts.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Auto\d{6}\.htm)".*?</A>##s"> <content_id name="copley.auto_world.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Business\d{6}\.htm)".*?</A>##s"> <content_id name="copley.your_business.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Career\d{6}\.htm)".*?</A>##s"> <content_id name="copley.career.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Classic\d{6}\.htm)".*?</A>##s"> <content_id name="copley.classical.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Consumer\d{6}\.htm)".*?</A>##s"> <content_id name="copley.alert_consumer.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Etc\d{6}\.htm)".*?</A>##s"> <content_id name="copley.hollywood.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(EB\d{6}.tif)".*?</A>##s"> <content_id name="copley.everybody.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FilmClose\d{6}\.htm)".*?</A>##s"> <content_id name="copley.film_close.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FilmFocus\d{6}\.htm)".*?</A>##s"> <content_id name="copley.films_focus.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Financial\d{6}-.\.htm)".*?</A>##s"><content_id name="copley.financial.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FitForum\d{6}\.htm)".*?</A>##s"> <content_id name="copley.fitness_forum.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FYITravel\d{6}\.htm)".*?</A>##s"> <content_id name="copley.fyi_travel.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(George\d{6}\.htm)".*?</A>##s"> <content_id name="copley.george.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Harry\d{6}\.htm)".*?</A>##s"> <content_id name="copley.harry.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Heart\d{6}\.htm)".*?</A>##s"> <content_id name="copley.heart.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Kemp\d{6}\.htm)".*?</A>##s"> <content_id name="copley.jackkemp.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Lifewire\d{6}\.htm)".*?</A>##s"> <content_id name="copley.lifewire.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Money\d{6}\.htm)".*?</A>##s"> <content_id name="copley.money.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(MR\d{6}.*?\.htm)".*?</A>##s"> <content_id name="copley.movie_reviews.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Music\d{6}\.htm)".*?</A>##s"> <content_id name="copley.world_music.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Net\d{6}\.htm)".*?</A>##s"> <content_id name="copley.net_sitings.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Relations\d{6}\.htm)".*?</A>##s"> <content_id name="copley.neilr.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Nutrition\d{6}\.htm)".*?</A>##s"> <content_id name="copley.nutrition.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?()".*?</A>##s"> <content_id name="copley..ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(People\d{6}\.htm)".*?</A>##s"> <content_id name="copley.inside_people.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(PopTalk\d{6}\.htm)".*?</A>##s"> <content_id name="copley.pop_talk.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Scan\d{6}\.htm)".*?</A>##s"> <content_id name="copley.bookshelf.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Ssocial\d{6}\.htm)".*?</A>##s"> <content_id name="copley.social_security.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(TravAdv\d{6}-.\.htm)".*?</A>##s"> <content_id name="copley.adventure.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(TravHist\d{6}\.htm)".*?</A>##s"> <content_id name="copley.travel_history.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(HerWorld\d{6}\.htm)".*?</A>##s"> <content_id name="copley.her_world.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(TVClose\d{6}\.htm)".*?</A>##s"> <content_id name="copley.tv_close.ft" /> </parser>
<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Video\d{6}\.htm)".*?</A>##s"> <content_id name="copley.video_reviews.ft" /> </parser>
<section description="Parse each article">
<source type="url" field="art_url" value="" />
<actions start_article="yes">
<parser type="Regex" output_field="title" args="regex=s#<headline1?>(.*?)</?headline1?>##si"/>
<parser type="Regex" output_field="body" args="regex=s#<bodytext1?>(.*?)</?bodytext1?>##si"/>
<parser type="Regex" output_field="byline" args="regex=s#<byline1?>(.*?)</?byline1?>##si"/>
<filter type="text->trim_whitespace" fields="ALL" />
</actions>
</section>
</actions>
</section>
</input>
</feed>