Unified Scrape Configuration Reference


Index

Periodic Table of Elements
Element Reference
    Main section
        <feed>
    General feed information
        <info>
        <info.name>
        <info.description>
        <info.author>
        <info.created>
    Scheduling section
        <scheduling>
        <schedule>
        <hours>
        <minutes>
        <days>
        <months>
        <weekdays>
        <scheduling.content_id>
    Fetch section
        <fetch>
        <fetch.location>
        <fetch.authentication>
    Input section
        <input>
        <section>
        <source>
        <actions>
        <content_id>
        <parser>
        <filter>
    Output
        <output>
        <output.instance>
        <output.map>
Parsers and Filters
    Parsers
        Regex Parser
        XML Parser
    Filters
        Paragraph_breaks
        Regex
        Remove_font_tags
        Remove_tags
        Trim_whitespace
Testing and Uploading Configuration Files
DTD
Sample Configuration Files


Periodic Table of Elements

Element Contents Attributes Allowed In
<feed> info
scheduling
fetch
input
output
version (required) n/a
<info> info.name (required)
info.description
info.author
info.created
info.categories (required)
none feed
<info.name> Plain text n/a info
<info.description> Plain text n/a info
<info.author> Plain text n/a info
<info.created> Plain text n/a info
<scheduling> schedule (1+)
scheduling.content_id (1+)
n/a feed
<schedule> hours (multiple allowed)
minutes (multiple allowed)
days (multiple allowed)
months (multiple allowed)
weekdays (multiple allowed)
label (required) scheduling
<hours> Plain text n/a schedule
<minutes> Plain text n/a schedule
<days> Plain text n/a schedule
<months> Plain text n/a schedule
<weekdays> Plain text n/a schedule
<scheduling.content_id> n/a name (required)
use_schedule (required)
expires_after
last_updated
scheduling
<fetch> fetch.location (required)
fetch.authentication
type feed
<fetch.location> n/a hostname
starting_dir
fetch
<fetch.authentication> n/a username
password
fetch
<input> section (1+) n/a feed
<section> content_id (1+)
source (1+)
actions (multiple allowed)
description
new_data_item
label
use
input
parser
source
<source> content_id (1+)
section (1+)
type (required)
field
value
section
<actions> content_id (1+)
parser or
filter or
section (1+)
label
use
input
<content_id> n/a name (required) section
source
actions
parser
filter
<parser> content_id (1+)
section (1+)
type (required)
input_field
output_field (required)
max_loops
args
actions
<filter> content_id (1+)
section (1+)
type (required)
fields (required)
max_loops
args
actions
<output> output.instance (1+) n/a feed
<output.instance> output.map (1+) type
args
output
<output.map> n/a source (required)
to (required)
output.instance

Element Reference

In the context of a DTD (Document Type Definition), an element's contents are tags that are nested between the current element's opening and closing tags. Attributes are values defined within the tag itself. An empty tag (a tag that has attributes but no contents) can be closed within the opening tag by adding a slash at the end, i.e., <tag value="example" />.

Main section

<feed>

Top-level element defining the entire feed configuration. Each element is optional, but the <input> and <output> elements are used only in conjunction with the <fetch> element. If <fetch> is not present, <input> and <output> are invalid.

Contents: <info>, <scheduling>, <fetch>, <input>, <output>

Attributes:

Usage:

<feed version="0.1">...</feed>

General feed information

These settings allow you to specify general information about the feed: name, description, author, etc.

<info>

Top-level tag containing general feed information.

Contents: <info.name> (required), <info.description>, <info.author>, <info.created>, <info.categories> (required)

Attributes: n/a

Allowed Inside: <feed>

Usage:

<info>...</info>

<info.name>

Specifies the name of the content provider.

Contents: plain text

Attributes: n/a

Allowed Inside: <info>

Usage:

<info.name>Content Provider Name</info.name>

<info.description>

Brief feed description.

Contents: plain text

Attributes: n/a

Allowed Inside: <info>

Usage:

<info.description>Feed from ContentProvider.com</info.description>

<info.author>

Author of the current configuration file.

Contents: plain text

Attributes: n/a

Allowed Inside: <info>

Usage:

<info.author>John P. Author</info.author>

<info.created>

Configuration file creation date. The only date format currently supported is YYYYMMDD.

Contents: plain text

Attributes: n/a

Allowed Inside: <info>

Usage:

<info.created>20000804</info.created>

Scheduling section

Contains information about the scheduling of the individual content_ids of the feed. Includes info about article expiration within a category, execution schedule of the category and when the category was last updated.

<scheduling>

Top-level tag containing scheduling information.

Contents: <schedule>, (one or more) <scheduling.content_id> (one or more)

Attributes: n/a

Allowed Inside: <feed>

Usage:

<scheduling>...</scheduling>

<schedule>

Scheduling for automated scrape. Several schedules can be set up and saved with different names, in the event that different feeds need to be dealt with at different times.

The following is an example of a schedule block:

<schedule label="Thursdays">
  <hours>3</hours>
  <minutes>00</minutes>
  <days>*</days>
  <months>*</months>
  <weekdays>Thursday</weekdays>
</schedule>

Contents: <hours> (multiple allowed), <minutes> (multiple allowed), <days> (multiple allowed), <months> (multiple allowed), <weekdays> (multiple allowed)

Attributes:

Allowed Inside: <scheduling>

Usage:

<schedule label="daily">...</schedule>

<hours>

Hourly scheduling. Accepted values are 0-23. Wildcards (*) are allowed.

Contents: Plain text

Attributes: n/a

Allowed Inside: <schedule>

Usage:

<hours>3</hours>

<minutes>

Minute to minute scheduling. Accepted values are 00-59. Wildcards (*) are allowed.

Contents: Plain text

Attributes: n/a

Allowed Inside: <schedule>

Usage:

<minutes>00</minutes>

<days>

Daily scheduling. Accepted values are 1-31. Wildcards (*) are allowed.

Contents: Plain text

Attributes: n/a

Allowed Inside: <schedule>

Usage:

<days>*</days>

<months>

Monthly scheduling. Accepted values are 1-12. Wildcards (*) are allowed.

Contents: Plain text

Attributes: n/a

Allowed Inside: <schedule>

Usage:

<months>*</months>

<weekdays>

Weekly scheduling. Accepted values are Monday-Sunday. Wildcards (*) are allowed.

Contents: Plain text

Attributes: n/a

Allowed Inside: <schedule>

Usage:

<weekdays>Tuesday</weekdays>

<scheduling.content_id>

Contains information about the feed, including the feed ID and expire time. The use_schedule value should correspond to the label on the appropriate <schedule> block.

Contents: none

Attributes:

Allowed Inside: <scheduling>

Usage:

<scheduling.category name="some_feed.ft" use_schedule="weekdays" />

Fetch section

Defines the fetching of data via various means (http, ftp, email, local file, etc.). Not all elements are required for all feed types.

<fetch>

Top-level tag containing fetch information.

Contents: <fetch.location> (required), <fetch.authentication>

Attributes:

Allowed Inside: <feed>

Usage:

<fetch type="http">...</fetch>

<fetch.location>

Specifies the location from which feed data will be scraped.

Contents: none

Attributes:

Allowed Inside: <fetch>

Usage:

<fetch.location host="http://www.contentprovider.com" starting_dir="/syndication/">

<fetch.authentication>

Username and password information, if authentication is required.

Contents: none

Attributes:

Allowed Inside: <fetch>

Usage:

<fetch.authentication username="username" password="password">

Input section

Processing for feed input. These elements allow you to define blocks of information and perform various operations on them.

The following is an example of an input block. Click on a line for a description of its purpose.

<input>
  <section description="get the xml file">
    <content_id name="nytimes.books.hl" />
    <source type="url" value="wireless_books.xml">
    <actions label="do for each content_id">
      <parser type="xml" output_field="article" args="tags=Article">
        <section new_data_item="article">
          <actions>
            <parser type="xml" output_field="headline args="tags=Headline"/>
            <parser type="xml" output_field="url" args="tags=URL"/>
          </actions>
        </section>
      </parser>
    </actions>
    </source>
  </section>

  <section>
    <content_id name="nytimes.business.hl"/>
    <source type="url" value="wireless_business.xml"/>
    <actions use="do for each content_id"/>
  </section>
</input>

<input>

Top-level element containing input information.

Contents: <section> (multiple permitted)

Attributes: n/a

Allowed inside: <feed>, <source>

Usage:

<input>...</input>

<section>

Enables logical grouping of sub-tags. The 'label' attribute allows the section to be marked for re-use later in the configuration file. This is useful if the same operations must be performed on a number of different feeds from the same source.

Contents: <content_id> (multiple allowed), <source> (multiple allowed), <actions>

Attributes:

Allowed Inside: <input>, <parser>

Usage:

<section description="Get index page">...</section>

<source>

Specifies the type of feed source (url or field).

Contents: <content_id> (multiple allowed)

Attributes:

Allowed Inside: <section>

Usage:

<source type="field" field="title" value="Downloads" />

<actions>

Groups parsing and filter actions together into a logical block, which can be labeled for re-use with the label attribute. The use attribute instructs the engine to use the specified block.

The example below illustrates how to re-use an actions block:

<stuff>
    <action label="do_useful_stuff">
        <parser.../>
        <filter.../>
    </action>
</stuff>

<more_stuff>
    <action use="do_useful_stuff">
</more_stuff>

Contents: <content_id> (multiple allowed), <parser> or <filter> or <section> (multiple allowed)

Attributes:

Allowed Inside: <input>

Usage:

<actions start_article="yes">...</actions>

<content_id>

Specifies the feed ID.

Contents: none

Attributes:

Allowed Inside: <section>, <source>, <actions>, <parse>, <filter>

Usage:

<content_id name="feature_story_news.global.ft" />

<parser>

There are currently two kinds of parsers: an XML parser and a regular expression parser. These parsers extract information from content and return the extracted information.

Contents: <content_id> (multiple allowed), <section> (multiple allowed)

Attributes:

Allowed Inside: <actions>

Usage:

 <parser type="regex" max_loops="20" input_field="INPUT" output_field= "article" args="regex=(.*?\n{2,}); destructive=yes">

<filter>

Content filtering utilities. Only one filter can be used per element; to use multiple filters, specify each one in its own element. Currently only text filtering utilities have been implemented. Future implementations will include filters for images, HTML, and other assorted types. To specify the type of filter being used (text, image, etc.), preface it with the type, i.e., type="text->paragraph_breaks".

Contents: <content_id> (multiple allowed), <section> (multiple allowed)

Attributes:

Allowed Inside: <actions>

Usage:

<filter type="Text->Remove_tags" fields="body" args="modifiers=gi" max_loops="10" args="foo">...</filter>

Output

Defines how to output information.

<output>

Top-level tag containing output information.

Contents: <output.instance> (one or more)

Attributes: n/a

Allowed Inside: <feed>

Usage:

<output>...</output>

<output.instance>

Specifies the output type and contains tags specifying any additional output information.

Contents: <output.map> (one or more)

Attributes:

Allowed Inside: <output>

Usage:

<output.instance type="Network_articles">...</output.instance>

<output.map>

Maps internal fields to output fields.

Contents: none

Attributes:

Allowed Inside: <output.instance>

Usage:

<output.map source="article->title" to="title" />

Parsers and Filters

Parsers

Regex

The regex parser searches for regular expressions, whether matching or substitution, and returns the results. Important: characters reserved in XML must be replaced with the corresponding HTML/XML character entity.

Parser Arguments:

Usage:

<parser type="regex" max_loops="20" input_field="INPUT" output_field="article" args="regex=(.*?\n{2,});destructive=yes">

XML

The XML parser searches data for the specified XML tags, and attributes within the specified tags.

Parser Arguments:

Usage:

<parser type="xml" output_field="feed_name" args="tags=feed;attribute=name" />

Filters

Article_uniqueness

Ensures articles are unique based on specified criteria.

Filter Arguments:

Usage:

<filter type="Text->Article_uniqueness" fields="title" />

Paragraph_breaks

Fixes paragraph breaks in text; \n\n gets replaced with <P>.

Filter Arguments:

Usage:

<filter type="Text->Paragraph_breaks" break="[\r\n]{3,}" modifiers="gi" />

Regex

Regular expression based filtering.

Filter Arguments:

Usage:

<filter type="Text->Regex" fields="body" args="regex=s#[\n\r]{2,}#\n&lt;P&gt;\n#sgi" /n>

Remove_font_tags

Removes <FONT...>, </FONT> tags from text passed to it.

Filter arguments: n/a

Usage:

<filter type="text->remove_font_tags" fields="all">

Remove_tags

Removes tags from text.

Filter arguments:

Usage:

<filter type="text->remove_tags" fields="all" args="tags=td" />

Trim_whitespace

Removes excess whitespace from text

Filter arguments: n/a

Usage:

<filter type="text->trim_whitespace" fields="all" args="modifiers=gi" />

Testing and Uploading Configuration Files

All commands to test and upload configuration files must be issued as the dbadmin user on scrapestage. Everyone must have the proper sudo privileges. Information about sudo can be found here. If you do not have the correct privileges, see Mattias.

This section is necessarily incomplete, because this environment will change with time.

Testing a config file:

Testing a configuration file currently consists of running the feed with as much error info turned on as possible. For information on how to use the scrape command, just type scrape at the command prompt:

[mbottner@alamo ~]$ scrape
Usage : scrape [-v[v[v]]|-q|-h] file
    file  name of the configuration file for the feed you want to run
    -h    this message
    -q    run the feed quietly (i.e. only fatal errors are reported)
    -v    determines how much information the feed should return.
          For more info, use more 'v's (i.e. -vv or -vvv)

To test a configuration file, make your telnet window at least 120 columns wide and type the following command:

[mbottner@scrapestage network]$ sudo -u dbadmin scrape -vvv config_file

Note that the command is being run as dbadmin by virtue of the sudo -u dbadmin command.

Much information will spew forth from the scrape engine onto your screen, and you will be overwhelmed. If this is the first time you're testing the configuration file, you will probably encounter an error message, as in the following example:

[mbottner@scrapestage network]$ sudo -u dbadmin scrape -vvv config_file
Using built in validity checks
Config file error: Unrecognized attribute 'host' in tag 'fetch.location':
   |
   |  <fetch type="http">
   |    <fetch.location host="http://www.channel2.co.uk" starting_dir="/xml" />
   |====================^
   |  </fetch>

This means there is an error in the configuration file that the engine could catch, and tells you where the error is. Occasionally the error message will point you to the wrong place, which can be frustrating, but this should be rare. Once you have corrected the error, rerun the feed. If there are more errors, keep fixing them until it works. Eventually you will see output that looks something like this:

[mbottner@scrapestage network]$ sudo -u dbadmin scrape -vvv bridge_news
Using built in validity checks
Main scrape engine started             |                                       |
                                       |                                       |
Initializing fetch system based on     |                                       |
parameters from config file            |                                       |
                                       |                                       |
Fetch system configuration:            |                                       |
                                       |                                       |
  Type        : http                   | <fetch type="http">                   |
                                       |                                       |
  Hostname    : http://www.bridge.com  | <fetch.location hostname="http://www. |
  StartingDir : bio                    | bridge.com" starting_dir="bio"/>      |
                                       |                                       |
Starting input processing...           | <input>                               |                                           |                                       |
Executing section                      | <section description="Get the         |
                                       | articles out of the index page">      |
                                       |                                       |
Executing source                       | <source type="url" value="biostory.   |
                                       | asp?story=8882&fixed=no&              |
                                       | newwindow=false"/>                    |
                                       |                                       |
Fetching data from url (http://www.    |                                       |
bridge.com/bio/biostory.asp?story=8882 |                                       |
&fixed=no&newwindow=false)             |                                       |
....

All of this output is generated as the engine runs the feed. It provides information about what is actually going on behind the scenes, with the commands being executed in the left column, and the corresponding XML from the configuration file in the center column. This information will be more detailed in future releases.

Note: There are currently many issues with debugging. The output section of the feeds does not work on scrapestage due to several missing tables in the database. Mattias will be working with ops to synchronize the scrapestage database on a regular basis.

Once the configuration file is running properly, it is ready to be uploaded to the production system.

Uploading a configuration file to the production machine:

Uploading is fairly simple. Using the command push_feed, the specified configuration file is copied from the staging machine (scrapestage) to the production machine (alamo) and inserted into the crontab. If the upload fails, a message will alert you and specify the reason for the failure. If the error message is unclear, contact Mattias for help.

For help on how to use the push_feed command, type the name of the command:

[mbottner@scrapestage express]$ push_feed
Usage : /usr/local/bin/push_feed [-v[v[v]]|-q|-h] file
    file  name of the configuration file you want to push live
    -h    this message
    -q    push the feed quietly (i.e. only fatal errors are reported)
    -v    determines how much information the push script should return.
          For more info, use more 'v's (e.g. -vv or -vvv)

Please note that when pushing a feed into production you must specify the feed name in relation to the config directory. For example:

[mbottner@scrapestage network]$ sudo -u dbadmin push_feed -vvv network/test
File '/usr/local/ft/config/network/test' successfully transferred to alamo

Even if you are already in the directory /usr/local/ft/config/network/test, you must specify push_feed -vvv network/test. In future releases this is likely to change to behave more like other commands.

In this example, the push was successful, meaning that the configuration file was copied over to alamo and installed in the crontab.


DTD

This is a preliminary DTD for feed configuration files. Once the system is set
up, the DTD should be able to change depending on available options for filters
and modules in the system. For example, if a new text filter is added to the
system, the DTD would reflect its availability as a filtering option.

If the module hierarchy is set up properly, it should just be a matter of
dropping a module into the proper directory to make it available (assuming that
it has implemented its interfaces correctly). The system should be able to see
it by querying all the modules in the directory.

#-------------------------------------------------------------------------------
# Main section
#-------------------------------------------------------------------------------

<!ELEMENT feed (info?, scheduling?, (fetch, input?, output?)?) >

<!ATTLIST feed version
          CDATA #REQUIRED >


#-------------------------------------------------------------------------------
# General feed information:
#   Section of DTD which contains general information about the feed: name,
#   description, author, etc.
#-------------------------------------------------------------------------------

<!ELEMENT info (info.name, info.description?, info.author?, info.created?,
                   info.categories)>
<!ELEMENT info.name (#PCDATA)>
<!ELEMENT info.description (#PCDATA)>
<!ELEMENT info.author (#PCDATA)>
<!ELEMENT info.created (#PCDATA)>


#-------------------------------------------------------------------------------
# Scheduling section
#  Contains information about the scheduling of the individual content_ids  of
#  the feed. Includes info about article expiration within a category, execution
#  schedule of the category and when the category was last updated.
#-------------------------------------------------------------------------------

<!ELEMENT scheduling (schedule*, scheduling.content_id*)>
<!ELEMENT schedule (hours?, minutes?, days?, months?, weekdays?)>
<!ELEMENT scheduling.content_id EMPTY>
<!ELEMENT hours (#PCDATA)>
<!ELEMENT minutes (#PCDATA)>
<!ELEMENT days (#PCDATA)>
<!ELEMENT months (#PCDATA)>
<!ELEMENT weekdays (#PCDATA)>

<!ATTLIST schedule
          label         CDATA #REQUIRED >
<!ATTLIST scheduling.content_id
          name          CDATA #REQUIRED
          use_schedule  CDATA #REQUIRED
          expires_after CDATA #IMPLIED
          last_updated  CDATA #IMPLIED >


#-------------------------------------------------------------------------------
# Fetch section
#   Defines the fetching of data via various means (http, ftp, email, local
#   file, etc.). Not all elements are needed for all feed types.
#-------------------------------------------------------------------------------

<!ELEMENT fetch (fetch.location, fetch.authentication?) )>
<!ELEMENT fetch.location EMPTY>
<!ELEMENT fetch.authentication EMPTY>

<!ATTLIST fetch	# Determines how we get data
          type          (ftp|http|email|file) #REQUIRED >

<!ATTLIST fetch.authentication	# If authentication is needed to pick up data
          username      CDATA #REQUIRED
          password      CDATA #REQUIRED >

<!ATTLIST fetch.location	# Where we get data from
          hostname      CDATA #REQUIRED
          starting_dir  CDATA >


#-------------------------------------------------------------------------------
# Input section
#   Instructions for feed input.
#-------------------------------------------------------------------------------

<!ELEMENT input (section*)>
<!ELEMENT section (content_id*, source*, actions?)>
<!ELEMENT source (content_id*, section*)>
<!ELEMENT actions (content_id*, (parser|filter|section)*)>
<!ELEMENT content_id EMPTY>

<!ATTLIST section
          description   CDATA #IMPLIED
          new_data_item CDATA #IMPLIED
          label         CDATA #IMPLIED
          use           CDATA #IMPLIED >

<!ATTLIST source
          type          CDATA #REQUIRED
          value         CDATA #IMPLIED >

<!ATTLIST actions
          label         CDATA #IMPLIED
          use           CDATA #IMPLIED >

<!ATTLIST content_id
          name          CDATA #REQUIRED >


#-------------------------------------------------------------------------------
# Parsing
#-------------------------------------------------------------------------------

<!ELEMENT parser (content_id*, section*)>

<!ATTLIST parser
          type          CDATA #REQUIRED
          input_field   CDATA #IMPLIED
          output_field  CDATA #REQUIRED
          max_loops     CDATA #IMPLIED
          args          CDATA #IMPLIED >

#-------------------------------------------------------------------------------
# Filtering
#-------------------------------------------------------------------------------

<!ELEMENT filter (content_id*, section*)>

<!ATTLIST filter
          type          CDATA #REQUIRED
          fields        CDATA #REQUIRED
          max_loops     CDATA #IMPLIED
          args          CDATA #IMPLIED >

#-------------------------------------------------------------------------------
# Output section
#   Defines how to output information
#-------------------------------------------------------------------------------

<!ELEMENT output (output.instance*)>
<!ELEMENT output.instance (output.map*)>
<!ELEMENT output.map>	# Mapping internal fields to output fields


<!ATTLIST output.instance
          type          (File|Network_articles|Express_articles)	# Handler type
          args          CDATA #IMPLIED >	# Additional arguments for output instance

<!ATTLIST output.map
          source        CDATA #REQUIRED	# Source of the data to output
          to            CDATA #REQUIRED >	# Where to output data.  Depends on the
                     	#   type of output instance that is doing the work

Sample Configuration Files

<feed version="0.1">
  <info>
    <info.name>Feature Story News: Global briefing</info.name>
    <info.description>Daily email feed from FSN</info.description>
    <info.author>Mattias Bottner</info.author>
    <info.created>20000714</info.created>
  </info>

  <scheduling>
    <scheduling.category name="feature_story_news.global.ft"        expires_after="1 week" executes="auto" />
  </scheduling>

  <fetch type="email">
    <fetch.location host="email://localhost" starting_dir="" />
  </fetch>

  <input>
    <section description="Parse email to extract articles">
      <content_id name="feature_story_news.global.ft" />

      <source type="field" field="INPUT" value="" />

      <actions>
        <filter type="text->substitute" fields="INPUT" args="old=^.*?USA\*\*\*\n*; new=" />
        <filter type="text->substitute" fields="INPUT" args="old=\*\*\*EUROPE\*\*\*\n*; new=" />
        <filter type="text->substitute" fields="INPUT" args="old=\*\*\*RUSSIA\*\*\*\n*; new=" />

        <parser type="regex" max_loops="20" input_field="INPUT" output_field="article" args="regex=(.*?\n{2,});destructive=yes">
          <section description="Extract the fields from each article">
            <source type="field" field="article" value=""/>

            <actions start_article="yes">
              <parser type="regex" max_loops="1" input_field="article" output_field="title" args="regex=^([^\n]+)\n.*;destructive=no" />
              <parser type="regex" max_loops="1" input_field="article" output_field="body"  args="regex=^[^\n]+\n(.*);destructive=no" />

              <filter type="text->trim_whitespace" fields="ALL" />
            </actions>
          </section>
        </parser>

      </actions>
    </section>
  </input>

  <output>
    <output.instance type="network_feed" />
  </output>
</feed>

<feed version="0.1">
  <info>
    <info.name>Copley news</info.name>
    <info.description>Daily feed from copley</info.description>
    <info.author>Mattias Bottner</info.author>
    <info.created>20000714</info.created>
  </info>

  <scheduling>
    <scheduling.category name="copley.arts.ft"            expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.auto_world.ft"      expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.your_business.ft"   expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.career.ft"          expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.classical.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.alert_consumer.ft"  expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.hollywood.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.everybody.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.film_close.ft"      expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.films_focus.ft"     expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.financial.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.fitness_forum.ft"   expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.fyi_travel.ft"      expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.george.ft"          expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.harry.ft"           expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.heart.ft"           expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.jackkemp.ft"        expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.lifewire.ft"        expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.money.ft"           expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.movie_reviews.ft"   expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.world_music.ft"     expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.net_sitings.ft"     expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.neilr.ft"           expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.nutrition.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.news.ft"            expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.inside_people.ft"   expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.pop_talk.ft"        expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.bookshelf.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.social_security.ft" expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.adventure.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.travel_history.ft"  expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.her_world.ft"       expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.tv_close.ft"        expires_after="1 week" executes="*/*, 12:00, Tuesday" />
    <scheduling.category name="copley.video_reviews.ft"   expires_after="1 week" executes="*/*, 12:00, Tuesday" />
  </scheduling>

  <fetch type="http">
    <fetch.authentication username="" password=""/>
    <fetch.location host="http://www.copleynews.com" starting_dir="/isyndicate" />
  </fetch>

  <input>
    <section description="Get index page">
      <source type="url" field="" value="Downloads" />

      <actions>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(ArtsLeis\d{6}\.htm)".*?</A>##s">   <content_id name="copley.arts.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Auto\d{6}\.htm)".*?</A>##s">       <content_id name="copley.auto_world.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Business\d{6}\.htm)".*?</A>##s">   <content_id name="copley.your_business.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Career\d{6}\.htm)".*?</A>##s">     <content_id name="copley.career.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Classic\d{6}\.htm)".*?</A>##s">    <content_id name="copley.classical.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Consumer\d{6}\.htm)".*?</A>##s">   <content_id name="copley.alert_consumer.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Etc\d{6}\.htm)".*?</A>##s">        <content_id name="copley.hollywood.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(EB\d{6}.tif)".*?</A>##s">          <content_id name="copley.everybody.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FilmClose\d{6}\.htm)".*?</A>##s">  <content_id name="copley.film_close.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FilmFocus\d{6}\.htm)".*?</A>##s">  <content_id name="copley.films_focus.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Financial\d{6}-.\.htm)".*?</A>##s"><content_id name="copley.financial.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FitForum\d{6}\.htm)".*?</A>##s">   <content_id name="copley.fitness_forum.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(FYITravel\d{6}\.htm)".*?</A>##s">  <content_id name="copley.fyi_travel.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(George\d{6}\.htm)".*?</A>##s">     <content_id name="copley.george.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Harry\d{6}\.htm)".*?</A>##s">      <content_id name="copley.harry.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Heart\d{6}\.htm)".*?</A>##s">      <content_id name="copley.heart.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Kemp\d{6}\.htm)".*?</A>##s">       <content_id name="copley.jackkemp.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Lifewire\d{6}\.htm)".*?</A>##s">   <content_id name="copley.lifewire.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Money\d{6}\.htm)".*?</A>##s">      <content_id name="copley.money.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(MR\d{6}.*?\.htm)".*?</A>##s">      <content_id name="copley.movie_reviews.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Music\d{6}\.htm)".*?</A>##s">      <content_id name="copley.world_music.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Net\d{6}\.htm)".*?</A>##s">        <content_id name="copley.net_sitings.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Relations\d{6}\.htm)".*?</A>##s">  <content_id name="copley.neilr.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Nutrition\d{6}\.htm)".*?</A>##s">  <content_id name="copley.nutrition.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?()".*?</A>##s">                     <content_id name="copley..ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(People\d{6}\.htm)".*?</A>##s">     <content_id name="copley.inside_people.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(PopTalk\d{6}\.htm)".*?</A>##s">    <content_id name="copley.pop_talk.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Scan\d{6}\.htm)".*?</A>##s">       <content_id name="copley.bookshelf.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Ssocial\d{6}\.htm)".*?</A>##s">    <content_id name="copley.social_security.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(TravAdv\d{6}-.\.htm)".*?</A>##s">  <content_id name="copley.adventure.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(TravHist\d{6}\.htm)".*?</A>##s">   <content_id name="copley.travel_history.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(HerWorld\d{6}\.htm)".*?</A>##s">   <content_id name="copley.her_world.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(TVClose\d{6}\.htm)".*?</A>##s">    <content_id name="copley.tv_close.ft" /> </parser>
	<parser type="Regex" output_field="art_url" args="regex=s#<A HREF="[^\"]*?(Video\d{6}\.htm)".*?</A>##s">      <content_id name="copley.video_reviews.ft" /> </parser>

	<section description="Parse each article">
          <source type="url" field="art_url" value="" />

          <actions start_article="yes">
 	    <parser type="Regex" output_field="title"  args="regex=s#<headline1?>(.*?)</?headline1?>##si"/>
	    <parser type="Regex" output_field="body"   args="regex=s#<bodytext1?>(.*?)</?bodytext1?>##si"/>
	    <parser type="Regex" output_field="byline" args="regex=s#<byline1?>(.*?)</?byline1?>##si"/>

      	    <filter type="text->trim_whitespace" fields="ALL" />
          </actions>
        </section>

      </actions>
    </section>
  </input>
</feed>