Supporting Digital Scholarship 2001 Annual Report

From the Institute for Advanced Technology in the Humanities
and the Digital Library Research and Development Group
University of Virginia

Table of Contents

1. Summary

2. About SDS

3. Staff & Spending

3.1. Staff

3.2. Spending

3.3. Training, papers, consulting

4. FEDORA

4.1. Basic data object

4.2. BDefs and BMechs

4.3. Application services

4.4. Testing FEDORA

4.5. Current FEDORA efforts

5. WebCollector

5.1. About

5.2. Starting window

5.3. Modes

5.3.1. Other Fields

5.4. Sample

6. Granby

7. GDMS

7.1. GDMS in SDS

7.2. About the DTD

7.3. Future plans

8. GDMS toolkit

8.1. Menus

8.2. Working with nodes

8.3. Validation

8.4. Adding resource files

8.5. Future features

9. The Salisbury Project

9.1. About Salisbury

9.2. Converting EAD/SGML to GDMS/XML

9.3. Stylesheets

9.3.1. Table of Contents Construction

9.3.2. Navigation Aids

9.3.3. Content Synchronization

9.4. Saxon

9.5. Data objects

9.6. Disseminators

9.7. Moving into FEDORA

10. The Rossetti Project

10.1. About Rossetti

10.2. Editorial architecture

10.3. Data objects

10.4. Disseminators

10.5. SGML to XML

10.6. Stylesheets

10.7. Moving into FEDORA

11. The Pompeii Forum Project

11.1. Information structure

11.1.1. Analysis

11.1.2. Digital Resources

12. Policy committee report

12.1. Current issues

12.1.1. Control and collecting

12.1.2. Work identification and integrity

12.2. Other issues

12.2.1. Copyright

12.2.2. Authenticity

12.2.3. Persistence

12.2.4. Editions/versions

12.2.5. Bibliographic, administrative, and structural control

13. Summary of progress this year

14. Appendix

14.1. Content models

14.1.1. Salisbury

14.1.2. Rossetti

14.2. GDMS DTD

14.3. XSLT stylesheets

14.3.1. Salisbury: dynaxml.xsl

14.3.2. Salisbury: parameter.xsl

14.3.3. Salisbury: autotoc.xsl

14.3.4. Salisbury: structure.xsl

14.3.5. Rossetti: rossetti.xsl

 

1. Summary

As new tools and methods for born-digital scholarship continue to proliferate, authors continue to lack guidance in choosing and using them, and libraries continue to lack experience in collecting and preserving what authors produce with them. “Supporting Digital Scholarship” (SDS) is a three-year project intended to address this problem by exploring in detail technical and policy issues raised by library collections of born-digital humanities scholarship. The project, now entering its third year, is a joint effort by the University of Virginia’s Institute for Advanced Technology in the Humanities (IATH) and Digital Library Research and Development Group (DLRDG). It is funded by the Andrew W. Mellon Foundation.

Over the past two years, SDS has been investigating and developing tools that would help libraries collect and preserve digital research. The most important of these is the Flexible Extensible Digital Object Repository Architecture (FEDORA) originally developed at Cornell and subsequently refined and implemented at the University of Virginia. We have imported two test projects originally published with DynaWeb, The Salisbury Project and The Complete Writings and Pictures of Dante Gabriel Rossetti, into FEDORA. We were able to emulate DynaWeb’s functionality in FEDORA, replicate DynaWeb’s proprietary stylesheets in XSL, adjust SGML DTDs to work with XML, and migrate the projects’ original SGML data into XML. We are currently developing an authoring workspace that will give researchers an environment in which to create new digital projects that libraries can more easily collect and preserve. We are also developing policy guidelines that libraries (beginning with our own) can use to determine what level of collection and preservation can be offered, given the characteristics of a particular digital publication.

This report contains updates on FEDORA and other tools that SDS has been developing over the past two years (such as WebCollector, Granby, and the GDMS toolkit) and updates on the test projects themselves. It also includes the results, to date, from the Policy Committee. The Appendix contain more detailed information about the Salisbury and Rossetti data, stylesheets, and DTDs, and the GDMS DTD.

 

2. About SDS

SDS has three fundamental areas of interest:

·        Scholarly use of digital primary resources

·        Library adoption of “born-digital” scholarly research

·        Co-creation of digital resources by scholars, publishers, and libraries

Our aim is to propose guidelines and document methods for creating, collecting, and preserving digital scholarship in the humanities, based on experiments with collecting real-world, born-digital scholarly projects into a working prototype of a real-world digital library system. 

There are complex problems in collecting and preserving born-digital scholarship. Research libraries have well-documented and long-tested methods for collecting traditional research in conventional media (e.g., books, photographs, and journals) but policies and procedures for the collection and preservation of digital media have, for the most part, yet to be established. Scholars with varying levels of technical skill and support are producing digital scholarship, and the results might require intensive intervention from would-be collectors. The digital objects in these works may be dependent on proprietary or unsupported software; critical functionality in the interface may be similarly restricted; digital resources may have unresolved copyright issues or have murky (or even unknown) provenance. The problems aren’t confined to the production end, since libraries are starting to acquire these collections but have no clear idea of how to manage them. Digital research may be the next great advance in humanities scholarship, but it requires digital libraries to establish standards and practices that promote long-term preservation of, and access to, that scholarship.

SDS is studying these issues with the aim of producing guidelines, tools, and methods for managing digital objects. A key part of its work has been using existing IATH projects as test cases to investigate technical and policy issues in the digital library system that is under development by the DLRDG.

 

3. Staff & Spending

3.1. Staff

In September 2001, SDS added one new staff member when Sarah Wells, a technical writer, joined the project. 

There are two working committees currently directing SDS activities, one for technical issues and one for policy issues. The Technical Committee is responsible for producing and implementing software and technical methods, and the Policy Committee is responsible for developing policy guidelines.

An Advisory Committee, made up of faculty and administrators who are institutional leaders and/or prominently engaged in the use of information technology in their own research and teaching, meets once a semester to review and respond to the work of the Policy and Technical Committees in the context of broader University perspectives, consider long-term issues in supporting digital scholarship, and disseminate information about this project to their colleagues.

The committee members are:

Technical Committee:

Rob Cordaro, Library (DLRDG) *

Kirk Hastings, IATH *

Chris Jessee, IATH *

Worthy Martin, IATH/Computer Science

Daniel Pitti, IATH

Steve Ramsay, IATH *

Perry Roland, Library (DLRDG)

Thorny Staples, Library (DLRDG), Co-Chair

John Unsworth, IATH/English, Co-Chair

Ross Wayland, Library (DLRDG)

 

*staff wholly or partly supported on SDS funding.

Policy Committee:

George Crafts, Library (Humanities Services)

John Dobbins, Art

Edward Gaynor, Special Collections

Sandy Kerbel, Science and Engineering Library

Phil McEldowney, Library (Social Sciences Services)

Daniel Pitti, IATH, Chair

Joan Ruelle, Science and Engineering Library

Thorny Staples, Library (DLRDG)


Advisory Committee:

Ed Ayers, Virginia Center for Digital History/History

Brian Balogh, History 

Johanna Drucker, Media Studies/English

David Germano, Religious Studies

Cheryl Mason, Curriculum, Instruction & Special Education

Kirk Martini, Architecture/Civil Engineering

Alan Howard, American Studies/English

Ben Ray, Religious Studies

Kathy Reed, Provost’s Office

Glen Robinson, Law School

Kathryn Rohe, Drama

Tim Sigmon, Information Technology and Communication

Kendon Stubbs, Library

Karin Wittenborg, Library

All three groups are invited to attend the public presentations of outside experts invited to UVA to consult on the SDS project.  In 2001 local events for these groups were:

·       March 2, 2001: visit from David Millman, manager Columbia University Academic Information Systems Research and Development group and participant in the Open Archive Initiative.

·       April 11, 2001: public presentation by Carl Lagoze, Cornell University Digital Library Research Group.

 

3.2. Spending

We are still less than half-way through the project’s budget, though we are two-thirds of the way through the project’s schedule.  We anticipate that we will continue to work on the aims of the project well beyond the end of calendar year 2002, perhaps for as much as another year, and we will continue to report to the Andrew W. Mellon Foundation for as long as the SDS budget sustains that work.

SDS spending to date:

 

 

Expenditures

Calendar year 2000

Expenditures

Calendar year 2001

Grand total

To date

 

 

 

 

Salaries

$98,058.49

$225,741.17

$323,799.66

Consulting

1,589.62

5,421.13

7,010.75

Travel

6,856.12

17,633.00

24,489.12

Training

4,491.73

17,117.59

21,609.32

Research

4,750.89

31,188.82

35,939.71

 

 

 

 

Total spent:

$115,746.85

$297,101.71

$412,848.56

 

SDS Budget 1/00 - 12/02

Revenue

Salaries

Consulting

Travel

Training

Research

Balance

 

 

 

 

 

 

 

 

Grant Budget

 

$631,000.00

$30,000.00

$75,000.00

$39,000.00

$225,000.00

$1,000,000.00

 

 

 

 

 

 

 

 

Expended 1/00-12/00

$41,554.51

-$98,058.49

-$1,589.62

-$6,856.12

-$4,491.73

-$4,750.89

-$74,192.34

 

 

 

 

 

 

 

 

Budget Balance 1/01

$41,554.51

$532,941.51

$28,410.38

$68,143.88

$34,508.27

$220,249.11

$925,807.66

 

 

 

 

 

 

 

 

JAN 01 Personnel

 

-$13,176.58

 

 

 

 

 

FB

 

-$4,150.62

 

 

 

 

 

M. Roberts (TEI)

 

 

 

-$995.00

 

 

 

M. Hedstrom

 

 

-$1,000.00

 

 

 

 

Cavalier Computers (K. Rinne)

 

 

 

 

 

-$3,378.00

 

FEB 01 Personnel

 

-$13,178.58

 

 

 

 

 

FB

 

-$4,150.62

 

 

 

 

 

Wages

 

-$350.00

 

 

 

 

 

Catering

 

 

-$550.10

 

 

 

 

MAR 01 Personnel

 

-$13,567.58

 

 

 

 

 

FB

 

-$4,273.79

 

 

 

 

 

Wages

 

-$1,604.00

 

 

 

 

 

D. McShane (DLF)

 

 

 

-$764.18

 

 

 

J. Ruelle (ACRL)

 

 

 

-$371.00

 

 

 

W. Martin airfare

 

 

 

-$274.50

 

 

 

C. Jessee airfare

 

 

 

-$885.00

 

 

 

R. Wayland airfare

 

 

 

-$255.00

 

 

 

J. Unsworth (lunch guest speaker)

 

 

-$95.52

 

 

 

 

R. Cordaro (Java GUI)

 

 

 

 

-$1,995.00

 

 

APRIL 01 Personnel

 

-$13,567.58

 

 

 

 

 

FB

 

-$4,273.79

 

 

 

 

 

Revenue April 01

$34,424.49

 

 

 

 

 

 

J. Ruelle (ACRL)

 

 

 

-$389.96

 

 

 

R. Cordaro (JAVA)

 

 

 

 

-$852.00

 

 

A. Rutkowski (ACH)

 

 

 

-$150.00

 

 

 

C. Jessee (SIGGRAPH)

 

 

 

 

-$575.00

 

 

R. Cordaro (Java GUI)

 

 

 

 

-$2,195.00

 

 

B. Scott

 

 

-$500.00

 

 

 

 

C. Lagoze

 

 

-$60.00

 

 

 

 

J. Allman

 

 

 

 

-$1,800.00

 

 

MAY 01 Personnel

 

-$13,567.58

 

 

 

 

 

Wages

 

-$516.00

 

 

 

 

 

C. Lagoze

 

 

-$193.21

 

 

 

 

K. Hastings

 

 

 

 

-$350.00

 

 

Copy Center

 

 

 

 

 

-$5.25

 

C. Lagoze

 

 

-$1,000.00

 

 

 

 

C. Jessee

 

 

 

 

-$963.54

 

 

JUNE 01 Personnel

 

-$3,205.55

 

 

 

 

 

FB

 

-$274.79

 

 

 

 

 

Wages

 

-$720.00

 

 

 

 

 

R. Cordaro

 

 

 

-$9,630.25

 

 

 

Catering

 

 

-$474.55

 

 

 

 

Supplies

 

 

 

 

-$690.00

 

 

JULY 01 Personnel

 

-$27,135.16

 

 

 

 

 

FB

 

-$8,004.88

 

 

 

 

 

Wages

 

-$4,183.55

 

 

 

 

 

FB

 

-$292.85

 

 

 

 

 

P. Roland (XML 2001)

 

 

 

 

-$607.48

 

 

C. Jessee (Computing Arts 2001)

 

 

 

 

-$128.61

 

 

Cavalier Computers

 

 

 

 

 

-$363.90

 

AUG 01 Personnel

 

-$13,567.58

 

 

 

 

 

FB

 

-$4,002.44

 

 

 

 

 

Wages

 

-$1,492.00

 

 

 

 

 

FB

 

-$104.44

 

 

 

 

 

C Jessee

 

 

 

-$1,089.20

 

 

 

K. Hastings

 

 

 

-$1,477.96

 

 

 

F. Johnson (Dreamweaver)

 

 

 

 

-$395.25

 

 

SEPT 01 Personnel

 

-$13,567.58

 

 

 

 

 

FB

 

-$4,001.44

 

 

 

 

 

Wages

 

-$644.00

 

 

 

 

 

FB

 

-$45.08

 

 

 

 

 

Business meals

 

 

-$375.57

 

 

 

 

Travel

 

 

 

 

-$5,240.48

 

 

C. Lagoze (Marriott Courtyard)

 

 

-$102.93

 

 

 

 

P. Roland (XML 2001)

 

 

 

 

-$1,325.23

 

 

Laptop (S. Wells)

 

 

 

 

 

-$1,710.00

 

OCT 01 Personnel

 

-$13,567.58

 

 

 

 

 

FB

 

-$4,002.44

 

 

 

 

 

Wages

 

-$526.96

 

 

 

 

 

C. Jessee

 

 

 

-$980.05

 

 

 

J. Unsworth

 

 

 

-$370.90

 

 

 

NOV 01 Personnel

 

-$16,747.50

 

 

 

 

 

FB

 

-$4,940.52

 

 

 

 

 

Wages

 

-$204.44

 

 

 

 

 

Catering

 

 

-$43.62

 

 

 

 

Catering

 

 

-$1,025.63

 

 

 

 

Software AG

 

 

 

 

 

-$21,240.00

 

DEC 01 Personnel

 

-$13,567.58

 

 

 

 

 

FB

 

-$4,002.44

 

 

 

 

 

Wages

 

-$563.65

 

 

 

 

 

ISP

 

 

 

 

 

-$330.00

 

C Digital

 

 

 

 

 

-$289.35

 

Computer Supplies

 

 

 

 

 

-$3,833.32

 

Cavalier Computers

 

 

 

 

 

-$39.00

 

 

 

 

 

 

 

 

 

Subtotal

$34,424.49

-$225,741.17

-$5,421.13

-$17,633.00

-$17,117.59

-$31,188.82

 

 

 

 

 

 

 

 

 

Totals

$75,979.00

$307,200.34

$22,989.25

$50,510.88

$17,390.68

$189,060.29

$663,130.44

 

3.3. Training, papers, consulting

Training

·        Robert Cordaro, Java Graphic User Interface Class, Sun Educational Services, CA. March 2001.

·        Chris Jessee, Moving Theory into Practice: Digital Imaging for Libraries and Archives Workshop, Cornell University Library Department of Preservation and Conservation, Ithaca, NY. May 2001.

·        Felicia Johnson, Introduction to Dreamweaver, Techead, Richmond, VA. August 2001.

·        Kirk Hastings and Perry Roland, Extreme Markup Languages 2001 Conference, Montreal, Canada. August 2001.

Travel and Papers delivered or proposed on matters relevant to SDS

·        Worthy Martin and Daniel McShane, Making of America II DTD Digital Library Federation Workshop, NYU, New York, NY. February 2001.

·        Joan Ruelle, Association of College and Research Libraries Conference, Denver, CO, March 2001.

·        Ross Wayland, Spring 2001 XML Connections Conference, New Orleans, LA. April 2001.

·        Steve Ramsey, “Planning Obsolescence: Dynamic Architectures for Digital Scholarship,” Joint International Meeting of the Association for Computers in the Humanities and the Association for Literary and Linguistic Computing, NYU, New York, NY. June 2001.

·        John Unsworth, “Publishing Originally Digital Scholarship at the University of Virginia,” ACH/ALLC Conference, NYU. June 2001.

·        Kirk Hastings, “The Salisbury Project: Collecting An Existing Digital Resource,” ACH/ALLC Conference, NYU, June 2001.

·        Kirk Hastings, “Progress of the Supporting Digital Scholarship Project,” panel discussion, ACH/ALLC Conference, NYU, June 2001.

·        Robert Cordaro, “Collecting HTML Scholarship,” ACH/ALLC Conference, NYU, June 2001.

·        Thornton Staples, Ken Price, Daniel Pitti and Alice Rutkowski, ACH/ALLC Conference, NYU, June 2001.

·        Robert Cordaro, ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, June 2001

·        Chris Jessee, ACM SIGGRAPH 2001 Conference, Los Angeles, CA, August 2001.

·        Chris Jessee, “Flash GIS: Delivering Geographic Information on the Internet,” Computing Arts: Digital Resources for Research in the Humanities, Sydney, Australia. September 2001.

·        Robert Cordaro, XML 2001 Conference, Orlando, FL. December 2001.

Consultants

·        Jim Allman, Interrobang Digital Media, January 2001. Flash training.

·        Carl Lagoze, Cornell University Digital Library Research Group, April 2001. Public presentation about metadata research at Cornell.

·        Bradley Scott, Routledge Electronic Development Manager, April 2001. Discussion of FEDORA from a commercial publisher’s point of view.

·        David Millman, Columbia University Academic Information Systems Research and Development Group, March 2001. Discussion of digital library research at Columbia.

 

4. FEDORA

The SDS project uses a digital library architecture called FEDORA (Flexible Extensible Digital Object Repository Architecture) as the basis of its work. FEDORA was first developed at Cornell by Carl Lagoze and others, but the DLRDG’s interpretation of FEDORA is the largest implementation of the conceptual architecture to date and the first real-world installation. FEDORA is “plumbing” that can be used to construct a digital object repository or to build a complete digital library system. The UVA implementation of FEDORA is a repository of digital objects made up of one or more data streams (pointers to resources such as a text or image file, executable code, etc.), descriptive and administrative metadata, and links to one or more disseminators (data structures that pair behaviors with methods that implement the behaviors). The number of disseminators and data streams can vary depending on a project’s particular resources and requirements, but the basic arrangement is shown below in Figure 1.

Figure 1         FEDORA

The repository system delivers its objects by way of Java servlets that handle requests to FEDORA from the user. In this manner, it may provide resource files (files containing the actual information, such as images, sound clips, and text), code for implementing behaviors, and other service applications. Since object components such as data streams and behavior mechanisms are pointers, they can refer to files and programs that are local (on the same host as the repository) or remote (on a different host). For the Salisbury and Rossetti projects, all of the resources, code for implementing behaviors, and application services reside on the same host as the repository.

There are three types of FEDORA objects: a data object, a behavior definition object (BDef), and a behavior mechanism object (BMech). A data object contains system and descriptive information about one or more resource files. A BDef object contains descriptive information about a set of methods that define a particular behavior. A BMech object points to executable files that can carry out a given set of methods. These three objects are bound together in varying combinations: BDef and BMech objects are always paired together by a Disseminator in one or more data objects. They can be mixed and matched as needed, although a pair must have the same list of methods (i.e., they must be related to the same behaviors). A data object can have multiple disseminators and a BDef or BMech object can be used by other data objects.

To use a FEDORA resource, the user’s browser contacts a Java servlet on an appropriate server and requests a certain behavior (e.g., “display a list of photos that show the Salisbury Cathedral cloister”). The Java servlet, in turn, contacts FEDORA and learns the location of the appropriate method for the desired behavior and the resource file. It sends the browser a URL that contains the name of a service application that can carry out the method and the resource file. The browser carries out the http request and gets back the desired information.

 

4.1. Basic data object

Figure 2         A generic FEDORA data object

A FEDORA data object has four basic parts: a persistent ID (PID), one or more disseminators, metadata, and a basis (Figure 2). The PID is a unique alphanumeric identifier assigned to each FEDORA object. Normally it contains a date-time stamp. For Rossetti, we decided to use entity names as PIDs to maintain internal consistency. The disseminator, which connects a BDef and BMech object, creates data structures that pair behaviors such as get_text() with methods that implement the behaviors. A data object can have as many disseminators as it likes but it must have at least one.

The metadata section contains information describing the data object and its data streams: provenance, technical specifications, etc. Currently there are minimum sets of required system metadata for every FEDORA data object and other metadata (technical, administrative, descriptive) can be added to the Basis as desired. The test cases discussed below have yet to include any metadata, but we will provide that in the future. In general, we expect that collecting libraries and collecting libraries will each provide different parts of the required metadata.

The basis is made up of one or more data streams, which are pointers to resources such as text or audio files. The resources can be inside or outside of the FEDORA repository. In the SDS test cases, all resources were in the repository.


 

4.2. BDefs and BMechs

Figure 3       A behavior object

A disseminator is made up of one or more pairs of a BDef object and a BMech object. The BDef object defines a set of behaviors or methods and the BMech object defines a set of executables that can execute the methods. A BDef object can be paired to more than one BMech (and vice versa) but each pair must have an identical list of methods.

BDef and BMech objects are located in the FEDORA repository. They each have their own PID, metadata, and data streams, as shown in Figure 3. PIDs are assigned by the object’s creator when the object is created, just as with data objects. The metadata consists of system metadata and either method definition metadata (in a BDef object) or method implementation metadata (in a BMech object). Method definition metadata describes a set of methods and its data streams are pointers to specification files that describe the methods. The specifications can be simple documentation about the methods (e.g., ascii files or MS Word documents describing the possible behaviors) or they can be more programmatic in nature (e.g., a Java interface that defines how a developer can implement these behaviors in a Java environment). Method implementation metadata describes how to implement the methods. The data streams can point to files inside or outside of the repository.


 

4.3. Application services

The user’s browser will need one or more application services that can display the requested information in the browser. For example, an XML file must be translated into HTML before the browser can display it. Application services can be in the repository or outside, as shown above in Figure 1.

 

4.4. Testing FEDORA

We have tested FEDORA with two IATH projects: Marion Roberts’ The Salisbury Project  (original version at http://www.iath.virginia.edu/salisbury/) and Jerome McGann’s The Complete Writings and Pictures of Dante Gabriel Rossetti: A Hypermedia Research Archive  (original version at http://www.rossettiarchive.org/). In each case, we decided not to collect certain peripheral parts of the site in the first instance, so that we could concentrate on issues raised by the project’s core data. The processes we used are described in detail in sections 9 and 10. We are working towards collecting a third project, John Dobbin’s The Pompeii Forum Project (http://pompeii.virginia.edu/). More information on our progress with Pompeii is in section 11.

 

4.5. Current FEDORA efforts

Further research and development efforts for FEDORA are begin carried out in a Mellon-funded joint project by the DLRDG and Cornell University. The grant proposal, related publications, and presentation slides can be found on-line at <http://fedora.comm.nsdlib.org/>.

 

5. WebCollector

We built the WebCollector to help execute the most highly automated, least expensive method for collecting The Salisbury Project and The Complete Writings and Pictures of Dante Gabriel Rossetti into FEDORA. For an example of the result of running WebCollector against the HTML portion of such a project, see the materials (exclusive of the image archive) available at http://dl.lib.virginia.edu/data/project/salisbury/ (for an example of a different strategy of collecting the more complex materials in the image archive, see below, section 9).

 

5.1. About

WebCollector is a web-based tool for copying HTML web sites for collection and preservation purposes. It’s intended to help libraries collecting and preserving on-line projects, research projects, dissertations, and theses. Collected pages can be moved to a protected environment such as a central digital repository or archive, or transformed to other formats as suits the project or the collector. The best candidates are sites with mostly static HTML pages with text and images.

The WebCollector won’t collect server-side scripts, databases, or user-created JavaScript functions. It also won’t fix bad HTML, so some dry runs in the test modes are required to be sure that the user doesn’t get unexpected results. It is run from the target machine (i.e., it can’t copy to a remote machine) and must be able to write to the specified local directory.

 

5.2. Starting window

A sample blank WebCollector page is shown below in Figure 4. A project title is required, since the title will be used in error, log, and object information filenames.

 

5.3. Modes

There are five different modes, three for testing and two for creating output. Users can select one from the pull down list in the upper right corner.

  1. Test Search Files/No copy: The WebCollector tries to parse each file in the specified URL and follow embedded links. It returns a list of how many pages it was able to parse and how many resources it found (web pages, images, etc). Nothing is copied.
  2. Show All Excluded URLs: Same as Test Search Files mode, but the WebCollector also returns a list of URLs that it found but did not parse because of the Bounding URL field. Excluded URLs will not be copied when running in Copy file mode unless the user specifies them in the Bounding URL field.
  3. Display Start URL contents: Same as Test Search Files mode, but it returns the projected HTML source code of the top page in the Search Start URL. This lets users verify that the copied output will be acceptable. Users can check the projected HTML of specific pages by entering a full path in the Start field.

Figure 4          Starting page for WebCollector

  1. Create objects/No copy: Creates an output file that can be used in generating FEDORA objects to hold copied HTML files in a FEDORA repository. The WebCollector does not actually make copies, however. This mode is useful if a project has already been collected and copied to another server but a user needs to alter the FEDORA object parameters. The file will be called <project title>.obj.
  2. Copy files & Create Objects: Parses and copies the specified URLs and creates three output files in the specified local directory: the output file, <project title>obj; an error/runtime log, <project title>.err; and a list of all links the WR encountered, <project title>.log. Users can use the output file to create FEDORA objects for the copied HTML files.

 

5.3.1. Other Fields

There are five more fields in the upper half of the WebCollector window shown in Figure 4, above. All fields must be filled in unless the user is running in any of the test modes. In that case, the user only needs to fill in the Search Start URL and Bounding URLs fields.

·        Search Start URL

The starting point for the WebCollector to begin parsing files. This path can be a domain name, a directory, or a specific file. This is a required field in all modes.

·        Bounding URLs

One or more paths that set the boundaries for WebCollector’s search: anything outside those boundaries will not be copied. Boundary URLs can be domain names, directories, or files. The user may need to run a few tests in Show all Excluded URLs mode to be check the settings. Precise boundaries are very important, since the WebCollector will conscientiously follow links all over the internet and it may collect far more than the user intended. This is a required field in all modes.

·        Change URLs to

This is a text string that will replace any URLs that point to outside references from inside the copied pages. In the current implementation, the string should be a call to the repository URN disseminator, specifying a servlet and its required parameters. For example:

action=dissem&sigName=web_default&methName=get_as_page&doid=1007.lib.dl.test/<project>-html

The resource name will automatically be added at the end. This is a required field unless it is running in a test mode.

·        Datastream URL

The top level URL. This indicates the directory that will contain the copied files after they have been processed. It tells the FEDORA repository where it can find the copied objects. This is a required field unless it is running in a test mode.

·        Copy files to Dir

A local directory for storing processed files. This directory must correspond to the Datastream URL above and must be accessible via a web server. Note that the WebCollector cannot copy onto a remote machine. This is a required field unless it is running in a test mode.

Figure 5, below, shows a sample run with all of the fields filled in.


 

5.4. Sample

Here is the result of a test run.

Figure 5          Results of a sample run

 

6. Granby

The Granby Suite is a software project at IATH, in partnership with the Brown University Scholarly Technology Group (STG) and the University of Maryland’s Maryland Institute for Technology in the Humanities (MITH). Granby is a free XML publishing tool for scholars and academic research projects that will significantly enhance access to, and preservation of, digital resources. All three institutions have produced highly regarded digital scholarly projects, using hardware- and software-independent SGML, and published them with expensive and monolithic SGML software. Until very recently, this approach was the best option for publishing such projects. IATH, STG, and MITH all strongly support SGML and XML for the creation and long-term storage of scholarly information, and all recognize the need for more modular, flexible, and extensible publishing software.

Now that the XSL standard has emerged, nonproprietary strategies are available for rendering XML data. With the emergence of XPath and Xquery, searching will soon be disentangled from proprietary software as well, and expensive, monolithic software packages will no longer be needed for electronic publishing or for search and retrieval.  Without an architecture to orchestrate these functions and control the interaction of software modules that perform them, however, we won’t really have escaped the problem of creating data that isn’t at some level hard-wired to the software we use to deliver it. Granby is developing an architecture that allows for a more flexible and extensible approach. Its design goals include a high degree of modularity, strict separation between program logic and interface, user-level and administrative extensibility, and use of free (open-source) tools. It will let developers connect text repositories to textual analysis software and powerful visualization tools for viewing statistical results, and it will support Unicode, which provides a standard method for encoding non-Roman characters. Its modularity will make it easier to develop, distribute, and update the individual components of a publishing system, and will make Granby suitable for collaborative and distributed development.

Developing this kind of software is prohibitively expensive for individual research groups, and previous attempts have proved inadequate to the needs of larger collections. Brown, Maryland, and Virginia’s combined resources will allow Granby to work on more practical and useful scale. Virginia’s part of this effort has been supported in part by funds from the SDS grant. More information about the Granby Suite is available on the Web, at http://busa.village.virginia.edu/granbydocs/granby.html.

 

7. GDMS

The General Descriptive Modeling Scheme (GDMS) is a DLRDG project to build formal information structures that can be used to model real-world phenomena and more abstract concepts. These models, in turn, can be used to create contexts for digital resources and collections. The underlying data structure is governed by an XML DTD, which allows the model to be hierarchical or flat and provides ways to cross-reference data inside or between models. Note that the model is a formal structure for content only: instructions for rendering or delivering views of the content must be provided by another information structure, such as an XSLT stylesheet. Potential GDMS applications include descriptions of collections with a complex structure (such as a set of architectural images or a set of resources related to an archeological site), annotated bibliographies of digital resources, museum exhibitions, and descriptions of historic or artistic events.

 

7.1. GDMS in SDS

In the SDS project, we used GDMS to describe the architectural structure of Salisbury Cathedral, providing a hierarchical structure to organize the 1000+ images in the project (see section 9). In the past year we have been doing design work to start moving data for The Pompeii Forum Project (PFP) to a similar structure (see section 11). PFP will also experiment with using a GDMS structure to organize the data for the project as a whole. The data structure describing the ruins of the Forum at Pompeii as a hierarchical arrangement of spaces and walls can be a branch of a larger tree structure that represents the project, with other parallel branches containing field notes arranged by types of text and date, descriptions of artifacts found in the ruins, and so forth. The system makes cross-references among different branches of the tree possible.

 

7.2. About the DTD

In general, a digital scholarly project is a collection of digital resources. To bring such a project into a digital library, the project/collection must be treated as a network of connected digital objects. Ideally, it starts with a single object that represents the project as a whole. That initial object becomes the “front door” for the project, from which point the whole project can be assembled via links to other objects. A GDMS model can be used to build a “front door” object for a project. As described above, a branch of a GDMS model can be both deep and broad, acting as the primary organizing context for a large set of resources. Or, a branch of the main tree that represents a project could be a single node deep, providing a connection to another type of data structure as well as metadata that establishes the context for other data within the project. For example, a branch of a GDMS model of an archeological project may be merely a connection to an SQL database that contains all of the measurements of the ruins at a site, or a connection to an EAD (Encoded Archival Description, an SGML DTD) finding aid that describes the collection of papers that are associated with a dig.

The GDMS DTD (included in the Appendix, section 14.2) is designed to facilitate two functions: editing a hierarchical structure and editing the contents of the structure. The data model is essentially a tree graph with a single XML tag being used to define each node of the conceptual tree, beginning with a single DIV element. Each DIV contains any number of other DIVs. All of the other contents of the XML instance or model, are contained within a DIV (with the exception of some header data for the model as a whole and a resource catalog). This allows the editing process to be developed around one set of functions that lets users build, change and navigate the model, and another set of functions that allows users to edit the contents of any particular div in the model.

In general, each DIV contains one DIV description (this is not required, though, and doesn’t have to be included), any number of resources, and any number of other DIVs. Each DIV can have both a type and label attribute that can be used to classify and name it. The DIV description is intended to further develop the semantics established by the attributes on the DIV. It includes a set of descriptive metadata tags, each of which is optional and repeatable, that can be used for short texts and/or data fields that describe the DIV. Each resource includes tags that describe and classify a digital resource with respect to its context in the model and tags that describe how to retrieve and use it. There are two types of resources: one has a pointer to a digital resource and the other includes in-line text.

 

7.3. Future plans

The project aims to create a tool set that includes the DTD itself (in two versions: the full DTD and the DTD for the current subset that the tool development effort is addressing) and a database toolkit that allows XML instances to be created, edited, searched, and rendered for display using XSLT stylesheets. The tool set is built around a general XML editor that allows a user to create and edit a single GDMS instance. A variety of other modules are planned that provide a way to more efficiently process and include different kinds of digital resources and to simplify manipulating a model for particular uses. All software developed for the project will be freely available when it is ready.

 

8. GDMS toolkit

We are currently developing a GDMS toolkit to help create or adapt XML documents to conform to the GDMS DTD. The toolkit is a Java application, executed from the command line with a GDMS schema as a required parameter.

Figure 6          Starting page of the GDMS Toolkit

The toolkit user sees a graphical user interface with two panes (Figure 6). The left pane is a tree representation of the GDMS structure that the user is building and the right pane is a window for entering and editing content and attribute values. The tree in the left pane displays nodes that contain other nodes (there are no leaf or text nodes). The user selects an item on the tree and the corresponding raw XML text appears in the right pane. He can delete or add items to the tree or edit the item’s content and attributes (see Figure 7, below).

 

8.1. Menus

The toolkit has several pull-down menus for major functions. Currently there are five:

·        Project: Create new projects, open, save, and close existing projects

·        Modify: Edit, delete, cut and paste, and reorder nodes

·        Add: Create new child nodes

·        Resources: Call other modules and link external resources files to items

·        Help: On-line assistance

Other menus may be added as necessary.

Figure 7          Editing a resource pointer node

 

8.2. Working with nodes

Users can edit, delete, cut, or move a node by selecting it in the left panel and using the appropriate pull-down menu. A template following the rules of the current schema will pop up with options for adding or editing nodes and, if applicable, pull-down menus with possible attribute values. Empty fields will be ignored when the changes are saved.

Sets of existing sibling nodes of the same type can be reordered with the Move Up and Move Down options in the Modify menu. A node can be moved to a new parent by cutting and pasting.

 

8.3. Validation

Normally, validation will be turned off in the toolkit, since the template’s rules should ensure that it generates valid XML. However, there is an option for validating an existing file when it is first opened.

 

8.4. Adding resource files

The toolkit has a module that lets users add other resource files (such as images and text files, although currently only image files can be added) to a project. These files must be accessible over the Web via URLs. The resources are collected and added to a catalog for the document.

 

8.5. Future features

The toolkit is still under development, but in the future we will add other features.

·        Extend the resource select module to handle other media types and to access objects already in the repository.

·        At the moment, users cannot add mixed content (text and tagged elements) to #PCDATA fields. We are hoping to find a way to do this.

·        Implement GDMS to create and handle large multi-file, multi-document structures.

·        Add on-line help and documentation.

 

 

9. The Salisbury Project

 

9.1. About Salisbury

The Salisbury Project (http://www.iath.virginia.edu/salisbury/ as originally published) is a collection of visual and textual information about Salisbury Cathedral and the surrounding area. It consists of hundreds of images with descriptive metadata encoded as an EAD document, and HTML-based textual material. The site also includes a teacher’s guide, maps, bibliographies, essays, and 3D models but we decided to focus our efforts on collecting the image archive. This section will provide a detailed explanation of work that was actually completed in the first year of this project, but not fully explained in the first annual report. 

The core of the Salisbury project is an image collection. The images are photos of the cathedral and are organized according to the cathedral’s ground plan. The collection’s home page offers several ways of navigating through the cathedral (Figure 8).

Figure 8          Salisbury image archive

The top left frame shows an expandable list of images arranged by physical location. The lower left frame is the ground plan of the cathedral and indicates which part of the cathedral a selected photo shows. The lower right frame shows thumbnails of photos in a given group. The upper right frame can be adjusted to show all or part of a full-sized photo.

The original image archive contains approximately 1000 images. There are three images for every photo: a thumbnail copy, a full-size copy, and a copy of the ground plan marked to show the photo’s location. All of this is organized as an SGML document using the EAD DTD, a standard for describing archival collections. In the original publication, access to the SGML data is provided through the DynaWeb electronic publishing package, a commercial web server that transforms SGML into HTML on demand.

 

9.2. Converting EAD/SGML to GDMS/XML

The DynaWeb version of the image archive uses an EAD SGML document plus a customized stylesheet that produces HTML on demand. We decided to translate the EAD SGML into a GDMS XML document, since it seemed a good opportunity to try out GDMS on an existing project. The image archive contains photos of the cathedral as well as sets of ground plans highlighting which parts of the cathedral are shown in the photos. GDMS  provides a context for describing the contents of and relations between digital objects, which we felt would be useful for connecting each photo to its corresponding ground-plan image. (See section 7 for more about GDMS.)

We first parsed and normalized the SGML EAD document to be sure that it was valid and would translate cleanly. This may seem an obvious step, but it is essential. A flawed document can cause many processing problems later on. We then needed to be sure that all of the SGML would function in XML. In this case, EAD’s container structure closely matches GDMS’s structure so we were confident that the documents would translate well.

We translated the SGML to XML by hand, using the Emacs editor and its PSGML major mode, with the regular expression search and replace function. This is probably not the ideal approach, but we felt that it in this case it would be the fastest and most practical. We might have been better off using SX to get into an XML version of EAD and then used an XSLT stylesheet to get GDMS XML out of that. This approach would have been somewhat more time-consuming, but it would have produced a reusable solution (the stylesheet) that could be applied to other EAD documents.

 

9.3. Stylesheets

Since we wanted to emulate a complex proprietary electronic publishing package so that resulting system was completely open-source, easy to migrate, and adaptable to similar tasks and we had decided to use XML, the XSLT transformation language was a natural choice. It would allow us to recreate the DynaWeb dynamic document interface. XSLT is part of the XSL stylesheet language and has proven to be a flexible and powerful tool for the transformation of XML documents, especially to HTML. The resulting stylesheets are included in the Appendix to this report.

A dynamic interface allows the presentation of text and available navigation options to be arranged according to the user’s needs. For example, the default view might show a table of contents of a work’s major sections (Figure 9A) and some prefatory material. If the user wants to see more detail in a particular section of the table of contents, a new table showing an expanded view of that section is dynamically created (Figure 9B). The associated content in the rest of the browser window is also changed to match the chosen section.

Figure 9          Salisbury table of contents

(A)

(B)

To emulate this deceptively simple behavior with a text transformation language we rely on four basic concepts: URL parameter passing, table of contents construction, navigation aids, and content synchronization.

In order for the user to be able to communicate with the system, his selections and requirements have to be passed back to the application This is done via URL parameters. For instance, a link in the table of contents contains information about the ID or node designation of the particular document section it describes.

URL: http://www.foo.bar/foobar?div.id=div1

The name/parameter pair div.id/div1 is passed back to the processing servlet. The servlet which makes the name/parameter pair available to the XSLT stylesheet, which locates the appropriate div using the following construct:

<xsl:value-of select="$div1"/>

 

9.3.1. Table of Contents Construction

An essential requirement of any dynamic system is the simultaneous display of different views of a particular document. In XSLT this is accomplished via template modes. In this way, a document can be reprocessed many times to produce the multiple views required to create a dynamic interface. The sample below shows a snippet of a template.

<xsl:apply-templates match=”div1” mode=”toc”/>

 

<xsl:template match=”div1” mode=”toc”>

  Do something…

</xsl:template>

This can then be built on to produce a nested table of contents. Modes allow you to process all the division heads of a document multiple times, once to create the TOC and again when they have to be displayed in the context of the content. First, process the level one divisions. If a division has a head element, the element can be displayed in the HTML of your choice. If a level one division contains one or more level two divisions with head elements, the stylesheet should be instructed to call the appropriate template:

<xsl:apply-templates match=”div1” mode=”toc”/>

 

<xsl:template match=”div1” mode=”toc”>

  <xsl:if test=”head”>

  <li><xsl:value-of select=”head”/></li>

  <xsl:if test=”div2/head”>

   <xsl:apply-templates select=”div2” mode=”toc”/>

  </xsl:if>

 </xsl:if>

</xsl:template>

This continues until you’ve processed all the divisions contained in this particular section, then on to the next.

 

9.3.2. Navigation Aids

Another common feature of dynamic interfaces is various types of navigation aids to help the user move easily through the document. The Salisbury image archive contains links that move you sequentially through the sections of the collection. This type of control structure can be contained in a template that generates these kinds of links. The template checks whether or not the particular section of the document being displayed has a preceding sibling. If it does, a linked arrow is displayed so that users can view that section. The link URL is constructed by concatenating by previously defined base URL with a name/value pair that passes the preceding section’s ID back to the server so that it can be used in rendering the next view of the document.

<xsl:choose>

 <xsl:when test=”preceding::div1”>

  <a>

   <xsl:attribute name=”href”>

    <xsl:value-of select=”$base.url”/>&amp;

    chunk.id=<xsl:value-of select=”preceding::div1/@id”/>

   </xsl:attribute>

   <img src=”b.prev.icon.gif”/>

  </a>

 </xsl:when>

 <xsl:otherwise>

  <img src=”d.prev.icon.gif”/>

 </xsl:otherwise>

</xsl:choose>

 

9.3.3. Content Synchronization

When you are displaying multiple views of a document at the same time, you usually want to synchronize those views. This is done by concurrently updating the parameters that control the display of each view. So when the stylesheet has to determine which plan image to display, the plan ID is matched to the correct figure and its entity attribute is used to construct the filename for the plan.

<img>

 <xsl:attribute name=”src>

  <xsl:value-of select=”//figure[@id=$plan.id]/@entity”/>.plan.gif

 <xsl:attribute>

</img>

With these four basic tools it is possible to construct something as complex as the DynaWeb publishing system or almost any other document interface whose focus is the structural navigation of a document.

 

9.4. Saxon

We needed a way to make our dynamic stylesheets available as a distributed web service. This could be readily accomplished by packaging Saxon, our transformation engine of choice, as a Java servlet. Saxon is a collection of tools for processing XML documents. It has an XSLT processor, which implements the Version 1.0 XSLT and XPath Recommendations from the World Wide Web Consortium. The Saxon software package includes a sample Java servlet that enables you to run the XSLT processor as a web application. We edited this servlet slightly to allow the XML documents and XSLT stylesheets to reside on separate servers, by making URLs valid as file pointers. Input to the Saxon servlet, including the source XML file and the XSLT stylesheet, is provided through URL parameters that are then passed to the servlet. This dynamic method of passing parameters enables the source XML and XSLT stylesheet files to reside on any web-accessible host regardless of where the Saxon service is located. (See http://saxon.sourceforge.net/saxon6.4.4/index.html for more information about Saxon.)

 

9.5. Data objects

Having built stylesheets and a service application, we needed to distribute and organize these many pieces of the image archive into FEDORA data objects. A FEDORA data object can have multiple data streams, so in theory we could have used a single data object pointing to all Salisbury resources. This would have been impractical, however, as well as difficult to maintain. The logical approach was to divide the image archive up into groups of the two versions of each image (thumbnail and medium) and the picture of the cathedral ground plan marked to show the image’s location (see the lower left corner of Figure 8 for an example). At this point the XML and HTML documents could have been handled by a single data object. However, they use different disseminators and we decided not to complicate the test any more than necessary. We used two different data objects for XML and HTML and created one object for each file, as shown in Figure 10.

Figure 10        Salisbury data objects

 

9.6. Disseminators

We used three disseminators. The image disseminator, shown in Figure 11, calls two methods, get_thumb() and get_med(). The get_thumb() method calls up the thumbnail size version of an image. The get_med() method is valid for both the medium images and the cathedral ground plan images.

Figure 11        Salisbury image disseminator

The XML disseminator, shown in Figure 12, calls one method, get_DynaWeb_view(). This method calls the XSLT stylesheet, which converts XML to HTML on the fly.

Figure 12        Salisbury XML disseminator

The HTML disseminator, shown in Figure 13, calls two methods, get_as_page() and get_in_context(), both of which will return requested HTML pages.

Figure 13        Salisbury HTML disseminator

See the Salisbury content models, Figure 21 in the Appendix, for more information.

 

9.7. Moving into FEDORA

The prototype FEDORA repository provides a set of batch loading programs written in Java for creating batches of similar types of objects (e.g. image objects, text objects, HTML objects). The input to these batch programs is a comma delimited data file containing the information necessary to create a FEDORA object (e.g., PID, disseminator names, metadata, URLs of data stream locations, etc.). The batch loading programs translate the information in the data into SQL requests, using JDBC to update the repository database, which is currently a MYSQL database. A perl script was used to gather the necessary information from the image archive and the object content model being used and to then build the input data file. Once the data file was built, the batch object creation program was run in order to generate a FEDORA object for each of the images in the archive. The result of this migration can be seen at:

http://dl.lib.virginia.edu/servlet/SaxonServlet?source=http:
//dl.lib.virginia.edu/data/xmltext/salisbury/salisbury.xml
&style=http://dl.lib.virginia.edu/bin/xsl/salisbury/dynaxml.
xsl&clear-stylesheet-cache=yes

The process for importing the HTML pages into FEDORA is slightly different than the process used for importing image objects. The HTML files were collected through the WebCollector program written here at UVA (see section 5). As the WebCollector program collected the web pages it generated a data file that was used to generate the HTML objects. The batch creation program was then run against the data file to create FEDORA objects for each HTML page.

 

10. The Rossetti Project

 

10.1. About Rossetti

The Complete Writings and Pictures of Dante Gabriel Rossetti: A Hypermedia Research Archive (as originally published at http://www.rossettiarchive.org/ and as collected in FEDORA at http://dl.lib.virginia.edu/data/project/rossetti-dyna/archive.html) is a very large edition of the works of the Pre-Raphaelite poet and painter Dante Gabriel Rossetti. A work in progress, it aims to provide a digital image of every textual and pictorial document relevant to Rossetti scholarship. The bulk of the current archive consists of SGML transcriptions of Rossetti’s textual materials as well as critical commentary on works and individual items.  By including both critical commentary and transcriptions with digital images, the Archive blends critical editing with facsimile editing.

Like The Salisbury Project, it originally used DynaWeb to transform SGML into HTML for Web distribution, and as with Salisbury, we decided to move its textual content to an XML environment that has non-proprietary, standards-based tools for transformation and rendering.  The DynaWeb version is made up of about 5,000 SGML files transformed by DynaWeb stylesheets for presentation in HMTL, about 5,000 image files, and some natively HTML pages. There are roughly 40,000 cross-references within the SGML files and preserving those relationships—in some real sense, the most important editorial work in this edition—was a critical part of this exercise.

The Archive went through several preliminary steps before the FEDORA import. The Archive before the FEDORA move did not have full functionality. Our DTDs and stylesheets had not been overhauled since the first time we published content to the web, and the document searches by DynaWeb queries failed to perform correctly because our complex entity names also had to function as workcodes. We wanted the FEDORA project to catalog the Archive in its best possible form and we felt it was a good time to set our house in order. So we streamlined our DTD structure: we had three distinct DTDs (one for documents, one for pictures, and one for commentaries) that we merged into one master DTD. We also teased apart the entity names of our documents so that workcodes succeeded in attaching documents to works. While not absolutely necessary for porting the Archive into the FEDORA system, these steps made the move much easier. Both the DTD and the workcode change could have waited until after FEDORA, but the FEDORA project gave us an opportunity to revisit and improve our design and functionality.

 

10.2. Editorial architecture

Rossetti was an accomplished visual artist and writer, and his artworks comment upon one another and upon works by other artists. The project editor, Jerome McGann, needed an editorial architecture that could represent complex relations among multiple documents and images. The creation of that architecture, for the most part, preceded the SDS project and involved much trial and error. For a long time, the burden of identification and inter-relation fell on a single identifier: each object in the archive, whether pictorial or textual, was given a unique ID. This ID appeared as an attribute on the Rossetti Archive Master (RAM) element, which is the root element of our SGML, and on DIV elements, which mark divisions in the document hierarchy (such as single poems within a collection).

But the ID, it turned out, could not be designed to imply an object’s relation to other objects. The Rossetti group had hoped that a search engine would be able to deduce these relations from the semantics of a given ID. For example, while a research assistant understood that the ID “s205a.mansell” meant “the Mansell reproduction of the ‘a’ version of item 205 in the Surtees catalog,” the DynaWeb search engine could not infer all related documents. Ultimately, the semantics of the ID attribute were too complex and too compact to be unambiguous. The problem was handled in various ways in the original DynaWeb publication (pre-collocating related materials, for example) but it became an unavoidable impediment to moving the Archive into FEDORA, since critically important contexts and relationships would need to be unambiguously expressed in order to be preserved. The solution was to disassemble the content of the original ID attribute and assign different parts of that content to several separate and unambiguous attributes.

The RAM and DIV elements now include information about the object’s type and its relations to other objects. For example, the RAM for object s205a.del, a photographic reproduction by Charles Mansell of the “a” version of the pictorial work s205, part of the double work 2-1867.s205 (“Lady Lilith”), might look like this:

<RAM ARCHIVETYPE=”rap” TYPE=”painting”
METATYPE=”web.visual” ID=”a.s205a.mansell”
IMAGE=”a.s205a.mansell.tif”
WIDTH=”747” HEIGHT=”500”
WORKCODE=”2-1867.s205” VERSION=”a” 
DBLWORK=”2-1867.s205” PHOTDUP=”mansell”>

These attributes explicitly set out the object’s type, how it should be rendered, and how it is related to other objects.

·        ARCHIVETYPE: Describes the kind of SGML file the object holds, according to the schemes of the Rossetti Archive. A RAP file is a “Rossetti Archive Pictorial” work. There are also RAD (a document, i.e., a textual item), RAW (a work), and RAC (a commentary on genres) files.

·        TYPE: Defines the type of object under discussion, in this case a painting.

·        METATYPE: Describes the object’s genre: a book, a serial publication, a poem, prose piece, a translation, a manuscript, a visual (pictorial) work, a double work, or a textual or pictorial work not done by Rossetti. These attributes enable the proper sorting of SGML files when the user interface is constructed.

Since the example is a visual work, the RAM element also includes information that describes the digital image for HTML rendering.

·        IMAGE: Identifies an image file holding a digital reproduction of the painting.

·        HEIGHT and WIDTH: Provide the image’s dimensions, so that the ImageSizer applet can perform on-demand scaling of the image.

The remaining attributes identify the object and define its relations to other objects in the archive.

·        ID: Identifies the particular object described by the SGML. This identifier includes most of the information which is then teased out in the following attributes

·        WORKCODE: Identifies the work to which the particular object is related.[*] In the archive, the workcode attribute specifies which work an SGML file is associated with. All files or parts of files that pertain to the same work (such as “The Blessed Damozel”) use the same workcode, enabling the archive’s search engine to find all instances of a work.

·        VERSION: This attribute only appears when there is more than one version of a work.  It is assumed that an exemplary version of each work exists. E.g., the exemplary version of paintings would be the first finished painting or in some cases the most complete painting or sketch. Rossetti often created writings and pictures that were meant to accompany one another, called double works, and in that case there is an exemplary version of the written text and of the painting. Thus there are two exemplary versions of “The Blessed Damozel,” one for the poem and one for the painting. Non-exemplary versions typically include drafts, revisions, or manuscripts of a written work and sketches, studies, or reproductions of a pictorial work. 

·        SUBSET: In some cases, Rossetti produced several independent works that also combined to make a single larger work. For example, the pair of sonnets that form the written part to the double work “The Girlhood of Mary Virgin” were conceived and published as two separate pieces but Rossetti considered them parts of a single poem.

·        DBLWORK (double work): This attribute appears for objects that relate to a double work. The dblwork attribute has the same value as the workcode attribute. In future builds of the archive, this attribute will enable users to go from one instance of a double work to other parts of the work.

·        RLTDOBJECT (related object): This attribute identifies a close association between a Rossetti work and the work of another artist. For example, a rltdobject attribute appears for all of Rossetti’s translations from the Italian stil novisti poets. In the future this will allow a user to hyperlink from Rossetti’s translations to the Italian original.

·        PHOTDUP (photo duplication): If the object is a photographic reproduction of Rossetti’s artwork, photdup indicates who made the reproduction.

 

10.3. Data objects

We needed to preserve very complex relationships between the individual pieces of the Archive when distributing and organizing the pieces into FEDORA data objects.  A FEDORA data object can have multiple data streams, so we had the option of using a single data object that would point to all of 10,000+ Rossetti resources. This would of course be extraordinarily difficult and extremely impractical to maintain, so we decided to create one data object per intellectual object and organize them by type. This simplified the task of importing resources into the repository later on.  

Figure 14       Rossetti data objects

As shown in Figure 14, there are three Rossetti data objects, one for images, one for documents, and one for HTML files.

 

10.4. Disseminators

We used three disseminators, one each for images, XML files (i.e., SGML files that were moved to XML), and HTML files. Disseminators and data objects needed to be carefully matched, to be sure that the correct method is called for a given data stream. A FEDORA data object’s disseminators must be bound to an appropriate BDef and BMech pair or the object will not function.

The image disseminator returns the requested image in the appropriate size. It uses the rossetti_image BDef object and the rossetti_image1 BMech object, as shown in Figure 15, and it calls two methods. The first,  get_thumb(), displays the thumbnail version of an image  and get_med() displays the larger version of an image.

Figure 15        Rossetti image disseminator

The XML disseminator uses the rossetti_text BDef object and the rossetti_text1 BMech object, shown in Figure 16. It calls one method, get_text(), which converts a given portion of XML into well-formed HTML.

Figure 16        Rossetti XML disseminator

The HTML disseminator uses the web_default BDef object and the web_default1 BMech object (Figure 17). It calls two methods, get_as_page() and get_in_context(), both of which call up requested HTML pages.

Figure 17        Rossetti HTML disseminator

See the Rossetti content models, Figure 22 in the Appendix, for more information.

 

10.5. SGML to XML

The DynaWeb version of Rossetti uses SGML files tied together with a combination of HTML pages and DynaWeb stylesheets. Before importing these files into FEDORA, we needed to transform the SGML into XML. We decided to use SX, a free open source command-line tool that converts SGML to XML. It parses and validates an SGML document then writes an equivalent XML document. It also warns about SGML constructs with no XML equivalent. James Clark is the tool’s author: see http://www.jclark.com for more information about SX.

 

10.6. Stylesheets

As with Salisbury, we needed to create a set of (standards-based) stylesheets that could emulate the functionality of the (proprietary) DynaWeb stylesheets: we again decided to use XSL and XSLT (see section 9.3 for more about writing XSLT stylesheets). The new stylesheets were written by the Rossetti project manager, William Hughes, using a set of templates generated by Kirk Hastings (included in the Appendix, section 14.3).

In general, when composing these stylesheets, any pointers to external references need to be changed to dissemination requests. In Rossetti, external references are usually in the form of entity names (which are the same as the workcodes described above in section 10.2). In SGML, entity references are like symbolic links and give shorthand names for a path. The entity declarations will appear in the DOCTYPE tag at the top of the document. I.e.:

<!ENTITY name SYSTEM “path”>

The name is a text string and is stored in a catalog. The path must start with http:// or ftp:// if it is not a local path (the parser will assume it is a local path unless told otherwise). The entity name will later be used in the document body, like this:

<tag xyz=“name”>

This will refer back to the ENTITY declaration.

We could have kept this entity-based scheme, but resolving entities slows down performance. This could theoretically be fixed with a search and replace script, but it would prove complex and difficult in practice. The best solution was to pull the ENTITY declarations out of the documents and transfer them to the stylesheets. This approach avoided undesirable intervention in the body of the document itself. So in Rossetti, “name” matches a file name. The stylesheets, therefore, can just take the entity name and fit it into the FEDORA PID:

1007.lib.dl.test/rossetti/name.[jpg|xml]

The suffix can be determined from other tag attributes. The PID can later be used in a dissemination request when the resource is requested from FEDORA.  

 

10.7. Moving into FEDORA

At this point, we had XML files and stylesheets that could be imported into the FEDORA repository. The HTML files were collected with the WebCollector tool (see section 5). The program automatically changed any calls to DynaWeb to a disseminator request, so that they could then be converted into FEDORA objects. The disseminators had already been prepared, as explained above, and since we decided not to include metadata we didn’t need to gather any more information about the resources. We didn’t need to generate PIDs, since we were using existing entity names as PIDs (as explained above). We used perl scripts to convert resources into FEDORA objects. This was actually fairly simple, since each data object has one XML or HTML file or a pair of image files as its data stream. We then inserted disseminators into all of the FEDORA objects.

One slight hiccup arose over pairing thumbnail images with their larger versions. In FEDORA, images are normally paired into sets of thumbnail image and medium image, but there weren’t thumbnails for every larger image. We decided to use a temporary “not available” image in place of missing thumbnails. If the script couldn’t find a thumbnail to pair with a medium image, it automatically paired the image with the “not available” thumbnail. 

Note that importing into the repository will probably be more complex in projects that have several disseminators that are not used by all object types, use multiple data streams in a single data object, or have complex sets of metadata.

 

11. The Pompeii Forum Project

The Pompeii Forum Project (PFP) is a collaborative research effort in archaeology and the history of urban design (http://pompeii.virginia.edu/, as originally published). It aims to provide the first systematic documentation of the architecture and decoration of the Forum, interpret evidence as it pertains to Pompeii's urban history, and make wider contributions to both the history of urbanism and contemporary problems of urban design.

PFP consists of a diverse collection of images and photographs, detailed image descriptions, architectural information, CAD models (some built by hand and some by photogrammetry), maps, survey work reports, essays, and a teacher’s guide. As with Salisbury and Rossetti, we do not plan on collecting the entire site but just specific parts. The most interesting and complex part is the collection of photographs and notes from research work on-site at the Pompeii Forum. The photographs were arranged in a logical sequence and transferred to digital form for display on-line. The site organizes these images in two ways. One is a set of tables with textual captions describing the images’ contents (Figure 18).

Figure 18        PFP image list

The other is a series of image maps of various parts of the Forum. Camera icons show where the photographs were taken and each photo’s angle and identifying number (Figure 19). Users can click on a camera icon to see the associated photograph. The PFP is still in progress, as new photographs are added and more of the Forum is examined.

Figure 19        A detail of the Macellum (the market place) image map.

We are taking the opportunity to try something different: we emulated the Salisbury and Rossetti sites when we collected them, but we are going to reconstruct the PFP information structures. We will collect the primary resources (images, notes, image descriptions, etc.) and put them into a wholly new underlying structure in FEDORA. The intellectual content will not change but it will be organized and accessed differently. We hope to mechanize this as much as possible, using the GDMS DTD and toolkit (see sections 7 and 8) to infer data structure and harvest metadata so as to identify and track objects once they have moved into FEDORA. The DTD and toolkit will also create a workspace for the Pompeii developers that will further automate the process of moving data into the FEDORA repository.

 

11.1. Information structure

The PFP information structure is still under development, but after much discussion between IATH, DLRDG, and PFP we have settled on an overall structure (shown in Figure 20) with two major components: catalogs of digital resources and the intellectual analyses of those resources. The left side holds descriptive information that constitutes the resource analysis and the right side holds the digital resources that are the basis of that analysis. The reason for the split is to separate metadata that describes a resource (such as the size of a photograph) and metadata that describes an idea (such as the relationship between two buildings).

Figure 20       PFP information structure (below)

The tags shown in Figure 20 come from the GDMS DTD (see section 7).

 

11.1.1. Analysis

The left side of Figure 20 attaches ideas and interpretations to specific parts of the forum. The <div> tags on the left refer to progressively smaller sections of the Forum, narrowing down the physical model to a specific wall. A <divdesc> tag then contains metadata describing that wall. A group of resources then points to a picture of the wall and a relevant article. This half of the information structure includes relationships between objects and arguable interpretations of the objects (such as time period, purpose, and use). The metadata in this section is about the content of the resources, not the resources themselves. So this particular section describes a particular wall, not the picture of the wall.

 

11.1.2. Digital Resources

The right side of Figure 20 shows the structure for the digital resources. The far right side shows an entry in the image catalog: a <res> tag holds metadata describing a photograph (size, type, photographer, etc.), any relevant copyrights, ownership, and a pointer to the image file. Similarly, an observation catalog holds notes and articles (although in this case the textual content is already in the XML file, rather than in a separate file somewhere else in FEDORA). The metadata in this section is about resource containers, not about what is in the containers. This section describe a particular picture of a particular wall, but it does not describe the wall itself.

 

12. Policy committee report

The primary charge of the policy committee is to recommend policy for libraries collecting digital works.[†] The committee began its deliberations with an explicit distinction between traditional and digital collecting, based on the widely held assumption that technical differences between analog and digital publications require changes in library methods and policy in order to fulfill traditional library objectives and responsibilities.

We first looked at the activities and objectives associated with traditional library collecting (discovery, selection, acquisition, preservation, description, access, control, and deselecting) and asked if any or all of them are relevant in digital collecting and if there are new activities that need to be added. The preliminary conclusion is that all of the traditional activities and objectives remain relevant. Indeed, much of the methodology for discovery and deselection is only slightly changed (e.g., the discovery process should include internet-accessible resources).

Major differences can be found, however, in the techniques employed to administer or manage digital works, their computer representations (in formats such as XML, JPEG, TIFF, MPEG, and the like), and the file or files that embody and fix the works. We have no firm conclusions yet, but we believe that selection, acquisition, description, access, and control of digital works will require substantially different policies and methods from those that govern the same activities with analog works.

 

12.1. Current issues

We are currently examining some of the more difficult administrative issues. These include physical and bibliographic control, work identification and integrity, authenticity, persistence, versioning, and copyright. The remainder of this section will discuss these issues and the key questions we are asking.

 

12.1.1. Control and collecting

Traditional collecting involves assuming specific obligations towards works collected and the responsibility to reliably fulfill these obligations. The core obligations are ongoing preservation and access to the works collected. Our starting assumption is that the Library will have essentially the same obligation and responsibility for digital collections.

The first and essential requirement of digital collecting is asserting physical control over the file(s) comprising works. Without physical control, the library cannot realistically and responsibly assume any obligations, especially obligations to provide ongoing preservation and access. Simply linking to resources controlled by others thus does not constitute collecting and cannot incur any of the obligations associated with collecting. Such physical control may, however, be delegated to a trusted outside agent.

 

12.1.2. Work identification and integrity

How do we identify works? Identifying works in traditional media has long been recognized as an intractable philosophical problem. The library literature abounds with articles and books exploring the challenging epistemological problems that can be raised in this area. The library profession is ultimately required to have practical solutions, however, and so the library community has identified relatively stable characteristics of traditional media on which they have established criteria and methods for identifying and controlling works.

First and foremost, traditional media come in discrete physical forms that present clear boundaries: books, journals, issues of journals, CDs, and the like. Taken together with the title page (or title pages in the case of multipart monographs and serials), the physical boundaries of traditional publications provide a reasonably recognizable unitary identity. Libraries generally create a bibliographic surrogate (or record) for each discrete work so identified. Discrete works are generally recognized at the “macro” level. Articles, chapters, individual poems, illustrations, etc., contained in the works are not described in the surrogate. The rationale behind this “macro” approach is largely economic, as detailed analysis of the contents of works is generally beyond the means of the library community. To compensate for this approach, libraries frequently subscribe to abstracting and indexing services that provide detailed analysis of works, and thus complement the bibliographic descriptions provided by libraries. While librarians recognize that this is an imperfect approach, it has been proven to be a workable methodology that provides acceptable access, description, and control.

Digital publications have so far proven to be less susceptible to practical methods like those used with traditional publications. Digital publications do not have explicit physical characteristics (with the exception of discretely packaged objects on portable storage media such as CD-ROMS). A work can be made up of more than one file, a file can contain more than one work, and multiple works can straddle multiple files. A digital publication may not even have an easily identified starting point.

Even identifying the principal work that a publication embodies may not be sufficient, since a given digital publication may be a collection of works. Catalogers have traditionally cataloged at the "book-level" or "serial-level" without providing detailed analysis of content since the library community established practical limits on the depth of analysis of traditional media, but digital publications lack the clear boundaries and commonplace internal characteristics of traditional media that made establishing such limits practical. It may be difficult or impossible to determine, for example, whether a given text is intended to be an independent work or simply a dependent part of a larger whole.[‡] Digital publications have yet to establish predictable and widely adopted conventions that could serve as a foundation on which to establish consistent and affordable bibliographic description and control.

In the absence of such digital publication conventions, it will be necessary for the Library to impose obligations on the creators, such as supplying information to assist librarians in identifying the contents of digital publications. In particular, it will be necessary for creators to explicitly identify and provide information about the "starting point" of digital publications, and information identifying the principal work and (if necessary) subworks represented in the publication.

 

12.2. Other issues

12.2.1. Copyright

This is a large and complex issue that the committee intends to take up in more detail soon. But some of the questions that we are starting to formulate include:

·        Is all or part of the digital work explicitly protected by the creator’s copyright?

·        Is all or part of the digital work explicitly protected by copyrights of someone besides the creator?

·        Does the creator have digital rights?

·        Does the creator have exclusive rights?

·        What are the library’s responsibilities for tracking and enforcing copyright?

 

12.2.2. Authenticity

When a library collects a book or journal, it is generally most interested in the information content, the words on the pages. Given this interest, the book itself is generally not an object of interest (the exception being rare and special materials). A book is still considered authentic if the library replaces its softcover binding with a hardcover binding. A digital work’s authenticity may, however, be inextricably intertwined with the software that indexes and renders it, and the specific aesthetics of that rendering. A carefully written stylesheet, for example, gives a very specific look and feel to data and can demonstrate scholarly analysis and criticism. Changing this look and feel may destroy the authenticity of the creation. The work as a whole, its contents, and the rendering of those contents, can have intrinsic value. This leads to several difficult issues.

·        Does the work as a whole have intrinsic value? That is, is there value in both the data (information content) and the data behaviors (look and feel)?

·        When the content and rendering are both considered to have value, are there circumstances under which it would be acceptable only to collect the content? And others under which it would not?

·        Or, can the data be collected independently of the data behaviors and can new library-determined behaviors be associated with it?

·        Does the creator, publisher, or library have the right to make this determination? Should it be negotiated among all three?

 

12.2.3. Persistence

Maintaining the persistence of the content and behavior of digital works, and the persistence of references and links among works both inside and outside the collection presents many technical and policy challenges. The committee is operating under the assumption that if content and the associated behaviors and relations of digital works are maintained in standard forms, then preserving them over an extended time will be manageable. References from works within the collection to and from works outside of the collection (and therefore outside of the library’s control) present a particularly difficult challenge because of the lack of control of one or the other end of a relation and the lack of a global solution to the problem of persistent identifiers and addresses. Among other questions, the committee intends to address the following issues:

·        Is the data and/or the data behaviors comprising the work based on open, public standard?

·        Are all files comprising the work controlled and transferable by the creator?

·        Are the links between resources comprising the work based on standards? Are the links embedded in the data representation? Or are the links maintained separately?

·        If links to a work-in-progress exist in other collected works, is the library responsible for maintaining versions in order to maintain the link’s integrity? What about links to works that are not collected and therefore not under the library’s control?

·        If a work is collected in successive editions and is linked by other collected works, is the library responsible for persistent identification of these works (or subworks and sub-addresses) in other collected projects? What about other uncollected projects that link to it?

·        If successive editions of a work are collected, are the editions interrelated? If successive editions of a work that has subworks (i.e., editions of subworks) are collected, are those subworks interrelated?

 

12.2.4. Editions/versions

Even works that are stable and relatively constant present many difficult administrative challenges. But not all works are stable and many digital works contain databases undergoing constant revision. Large text and image collections are built over a period of many years, but are sufficiently useful to be published at certain points during development. Already complex challenges are further complicated when works undergo regular or relatively frequent revision. Among questions to be address by the committee in this area are the following:

·        Is the work complete and stable or is it undergoing regular or frequent revision?

·        If undergoing revision, do users need real-time access to new material, as it is authored?

·        If undergoing revision, will the production methods support controlled publication of stable editions or versions?

 

12.2.5. Bibliographic, administrative, and structural control

Digital control involves control of works (intellectual description, access, and rights), the files embodying works, and relations between files that are essential for reproducing works on demand. These areas of control have been identified as bibliographic, administrative, and structural data. Each requires data above and beyond the work’s content (i.e., metadata).

·        Is the library responsible for providing descriptive cataloging of all works collected?

·        Should the creator provide descriptive data? If so, what descriptive standards and practices will be required, and how will creators be trained to supply this data?

·        Who is responsible for creating, identifying, and maintaining administrative data such as file inventories, creation and standards information (context of creation), and copyright information?

·        Who is responsible for creating structural data, such as the data needed to reproduce a "book" from multiple page image files?

 

13. Summary of progress this year

In the past year we moved the Rossetti project, a highly structured and very extensive scholarly work, into a sustainable form in a library system — a first, we believe, for any born-digital scholarly publication of this size and complexity. We have (in this report) thoroughly documented that process and also documented, in similar detail, last year’s importation into FEDORA of more than a thousand images and related texts from the Salisbury project. We have also made considerable progress in planning and designing a strategy for collecting The Pompeii Forum Project, which we will carry out in the coming year. If SDS were to produce only this, it would still have shown that it is possible to collect and preserve digital scholarly publications that originate and develop outside of the library’s control, provided that these projects are standards-based (as much as they can be), that one has access to the underlying data and can modify it as necessary, and that one has the time, expertise, and cooperation with the author to analyze the problems collection will raise and plan a strategy for resolving them.  This sounds expensive and it is, compared to collecting a book or a journal, but even in this initial exploratory project the incremental cost of bringing existing scholarly projects into some substantially collectible form has been a small fraction of the cost of producing the publication in the first place. Given that, and given the fact that at this point the world contains a limited number of historically and intellectually important publications of this sort, the odds of being able to collect those publications and being able to produce a more widely accepted and well understood set of standard practices for future authors seem quite good.

One of the more important, if obvious, points to emerge from the experiences of the past year is the need for unambiguous encoding. If a work relies on local knowledge to function properly or if too much of its structure is implicit or requires human intelligence to infer, it will be exceedingly difficult for a library or archive to collect it. In our experience, the kind of ambiguity encountered in the Rossetti workcodes almost inevitably creeps into file-naming and reference schemes over the course of digital scholarly projects, and once it has developed it can be difficult to correct. Both the Rossetti and the Salisbury projects were created at IATH with the best design and implementation we could provide at the time, but we still needed to take substantial steps to disambiguate references within the collection, as well as other more predictable steps such as normalizing data, checking for missing files, analyzing possible problems with moving from SGML into XML, and so forth.

Clear and consistent documentation of all sorts — from file management to house style to editorial theory — also significantly improves a born-digital scholarly publication’s chance of being collected and preserved by libraries. Without such documentation, it will be difficult to determine what level of collection is reasonable, and how much work will be required to accomplish it. Moreover, since any project that develops outside of a library or library-supported workspace will require at least some transformation before it can be collected, good documentation can also increase the chance of automating some or all of that transformation, making collection and preservation less expensive and therefore more likely.

Finally, we have already made some progress (and intend to focus much of our energy in the coming year) on developing a library-friendly workspace for authors of digital scholarly content. The tools in this workspace will support agreed-upon standards and the environment will promote best practices. This will help scholars produce their own digital work in a library-friendly environment, making it much easier for libraries to collect and preserve their work. We expect that WebCollector, Granby, and the GDMS toolkit will be included in the workspace, and we plan on using The Pompeii Forum Project as a test case for the workspace. We will also select at least one other project that is still in an early stage of development, to see how the workspace can affect a project that doesn’t have legacy data to retrofit or established practices to unlearn. 


[*] For all practical purposes the archive uses the heuristic distinction between “work” and “document” which textual theorists such as G. Thomas Tanselle have elaborated during the past thirty years. A work is the idea that informs the creation or proliferation of a work. A document is any of the physical manifestations of that idea. For example, when considered as a work “The Blessed Damozel” is not represented by any physical document but is instead the informing idea that connects a set of texts and images. A sketch of the main Damozel figure, a sketched background figure, a poem that narrates the event of the picture, the finished oil painting, and the revised poem are all separate documentary instances related to the work.

[†] In this section, work is defined as an independent, identifiable intellectual creation. Independent means that the work is coherent and useful without contextual information. Identifiable means that it has recognizable boundaries and, typically, an explicit name.

[‡] The word intended is used with care, since this requires a certain degree of judgement. An auther may view a specific unit of information as a component but a reader or cataloger may consider it a work. Such judgements necessarily need an intimate knowledge of the entire work as well as its parts. The cataloger cannot be required or expected to have such knowledge.