Newsgroups


DESCRIPTION:

A collection of a million records is being assembled. Mail messages or Newsgroup messages were possible candidates. We are using selected Newsgroup categories as the source for the collection.

PHYSICAL SOURCE:

A live newsgroup feed is being tapped to collect 1 million messages from SDSC's Newsgroup Server. This was chosen instead of accessing data from a newsgroup archive to learn the issues involved with archiving of data feeds. An initial collection was generated of 800,000 entries and used to validate the ingestion procedures and data object characterization. An analysis was made of the collection to verify that individual entries could be separately identified. Based upon the analysis results, the ingestion scripts were modified to automatically encapsulate each entry as a well-defined digital object during the ingestion process. This is one of the major lessons learned. The difficulty is that there was sufficient variation in the message format from the published standard that externally imposed tags had to be used to guarantee that each object could be differentiated from within the data flow.

Four principal newsgroup categories were selected: comp, humanities, sci, soc, as they contain, respectively, computer messages, humanities messages, general science messages, and social science messages.
 
Comp
733
Humanities
7
Sci
184
Soc
246
 
1,170

Total newsgroup count monitored Over 1,000 individual newsgroups are represented. Examples of newsgroups in each category are: As of January 7, 1999, the collection process has been going on for 20 days. We have collected 2/3rd of the million record collection, representing an aggregrate storage size of 1.7 GB. This corresponds to 1/4 million postings per week, with an average size of 2.7 KB. The collection generation rate is limited to the rate at which messages are posted to the newsgroups. The million record collection is expected to be completed by January 20.

It appears that "humanities" and "social science" messages are on the average a little longer (3.1 & 3.2 KB), which could indicate that scientists and engineers are a little more terse! J
 
 
# postings
size
#postings/week
size of posting
Comp
349,609
749.0 MB
116,000
2.2 KB
Humanities
3,076
9.2 MB
1,000
3.1 KB
Sci
77,688
196.0 MB
26,000
2.6 KB
Soc
240,652
749.0 MB
80,000
3.2 KB
 
671,025
1.7 GB
223,000
2.7 KB

 

OBJECT LEVEL STRUCTURE / META-DATA:

To be able to discover an individual message from within the collection, meta-data is needed. Individual message objects follow the Network Working Group RFC-1036 standard, a standard for Interchange of Usenet Messages. RFC-1036 refers to RFC-822, a standard for the format of ARPA Internet text messages (August 13, 1982).

A standard USENET message consists of several header lines, followed by a blank line, followed by the body of the message. Each header line consists of a keyword, a colon, a blank, and some additional information. The Internet convention of continuation header lines (beginning with a blank or tab) is allowed.

Certain headers are required, and certain headers are optional. Any unrecognized headers are allowed, and will be passed through unchanged.

Required headers are:

From:
Date:
Newsgroups:
Subject:
Message-ID:
Path:
Optional headers are:
Followup-To:
Expires:
Reply-To:
Sender:
References:
Control:
Distribution:
Keywords:
Summary:
Approved:
Lines:
Xref:
Organization:


We automated the generation of meta-data from the collection by looking for common attributes across all messages. A survey of the 2/3rd million messages collected so far (done with a utility Perl script) pulled out the following common header fields across all messages.

From:
Date:
Lines:
Message-ID:
Newsgroups:
Path:
Subject:
Xref:


meaning that although optional according to RFC-1036, Lines:, and Xref:, were always present. The analysis script took less than an hour to run on an SGI Indigo II. These attributes can be used to organize the data into a collection.

When surveying the optional fields which appear in the messages we have collected so far, hundreds of spurious fields, including exotic user-created ones, appear (additional "header-union" Perl script):

X-spam-hater:
X-WebTV-Signature:
X-Christmas:
X-Coffee:
Return-Path:
Status:
Originator:
Abuse-Reports-To:
These fields comprise semi-structured data that can be associated with the collection. The organization of semi-structured data is a research project at SDSC, with the goal of being able to tag semi-structured data with XML, and then support queries against the XML tags.

MISCELLANEOUS:

The collection process required the implementation of checks to verify that all of the data stream was being archived. Several Perl scripts were written to monitor the arrival of new messages and do sanity checks on the messages received. The main script when first executed recursively descends through each newsgroup subdirectory (ex: /misc/news/spool/comp, /misc/news/spool/humanities, /misc/news/spool/sci, /misc/news/spool/soc) and retrieves posted messages concatenating them into a large buffer that will later be archived to HPSS (we currently use 4 buffers, one per newsgroup category).

Postings are stored in individual newsgroup directories. For example soc.history.war.vietnam contains posted messages in /misc/news/spool/soc/history/war/vietnam and consists of one individual file per message posted (example: 1001 1002 1004 1005 1008 would indicate that there are currently 5 messages available.) Typically messages get purged from the News Server after 3 days, hence our Perl script is executed every 2 1/2 days. The script keeps an internal database that logs which newsgroups names it visits and what the current messages-posted range is, so that it can in later runs only grab the differential or new messages that have arrived.

This process results in an archive copy of the data stored into HPSS. The time needed to assemble the data, discover meta-data, and store the result in HPSS is roughly one hour per 200,000 messages. The time to ingest the digital objects into a database will be longer, and is under investigation.

Demonstrations are planned for February to illustrate the following capabilities: