David Seaman, Electronic Text Center, University of Virginia
The volumes in the University of Virginia's on-line collection of electronic texts are tagged using Standard Generalized Markup Language (SGML), a system for describing structural divisions in text (title-page, chapter, scene, stanza, etc.), typographical elements (changes in typeface, special characters, etc.), and other textual features (grammatical structure, location of illustrations, variant forms, etc.).
SGML tags comprise of ASCII data only; they are not proprietary to a particular computer program. This sets them apart from -- say -- the codes in a WordPerfect document, which belong to, and are meaningful only within, the WordPerfect program. And while the WordPerfect code defines something by its visual appearance -- a word is italicized -- SGML is designed to describe the class of information to which the phrase belongs. Italics can be used for a variety of purposes, and most SGML tagsets can clearly delineate an emphatic word from a book's title or a chapter heading.
By recording the structure of a text, such tags allow one to use an SGML search program to constrain searches to particular elements: one cannot limit a search to a single chapter in a novel if there are no markers in the text for chapter divisions; one cannot view a quotation from a play in the context of a scene if the scenes are not delimited.
A chapter whose title should appear in italics could be tagged like this:
<div1 type="Chapter" n="1">
<head rend="italics"> Chapter Name </head>
<p>[The text of the chapter goes here]</p>
Features to notice
- The chapter is enclosed within a pair of "major division" --
<div1> -- tags. As with all tag pairs, the closing tag is distinguished
from the opening one by the inclusion of a forward slash mark --
</div1>. In this instance, the opening <div1> tag also contains
information about the type and number of the division.
<head> tags enclose the chapter's title; the opening tag
contains the additional information that the title should be
rendered in italics.
- The text of the chapter will be contained in tags such as those that mark off paragraphs -- <p> </p>.
The tags and procedures used by the Electronic Text Center are part of the Text Encoding Initiative (TEI), an implementation of SGML for humanities texts. We follow a sophisticated and well-chosen subset of the TEI tags called TEILITE and have recently been migrating our materials over to the TEIXLITE standard, produced by Michael Sperberg-McQueen, Lou Burnard, and the TEI Consortium (the encoding examples now reflect these XML difference!
The Aims of the Center
Our goal at the Electronic Text Center is to offer a wide range of accurate electronic texts for a broadly conceived humanities community at Virginia and beyond. The texts we acquire and create are marked up with SGML and become part of our on-line archive; whenever legally possible, we give public access through the World Wide Web to our online texts, for non-commercial uses. The Web- accessible texts go through a "TEI-to-HTML" converter when the user requests to see them; that is, the conversion occurs "on the fly". Please see the Conditions of Use statement before downloading any texts.
Considerable attention is given to the accuracy and completeness of these texts, and to accurate bibliographical descriptions of them. Book illustrations and other supporting visual material (manuscript pages from Special Collections, for example) are included whenever we can. Such practises are essential to the building of a long-term textual resource, and both academic and general readers alike require accurate and attractive texts. Equally important is the effort we put into building a user community, and to providing training, documentation, and support appropriate to our patrons.
The software we use
SGML texts are not, of course, designed to be read "in the raw". Ideally, one uses them through software tools that interpret the tags as database "fields" while searching and as a set of typographical layout instructions while displaying the results.
The software we currently use to index and search our databases is PAT 5.0, the OpenText search engine [no longer available], a tool originally designed for use with the Oxford English Dictionary.
The search engine is accessed through a Web interface built in the UVa Library, which makes use of a TEI-to-HTML converter, written at the Electronic Text Center, that converts the TEI text to HTML "on the fly". This eliminates the need to keep HTML copies of our TEI texts on the server. The texts can be marked up once, in a tagset that suits the material being marked, and Web access is still possible. The texts in the Middle English and Modern English sections of our on-line holdings are good examples of the TEI-to-HTML process in action.