The Basic Text-Processing Procedures at UVa
David Seaman, Electronic Text Center, University of Virginia
What follows is a step-by-step set of guidelines for processing texts at UVa's Electronic Text Center. It largely assumes that the electronic text is derived from a print or manuscript source; to date, this has been the case for the vast majority of the texts we have processed. While the precise details of these procedures are specific to UVa., the general process and assumptions should be easily duplicated elsewhere.
- Assuming a text passes an initial inspection, it will be put in a to-do directory and assigned to a preparer. The to-do directory is a holding place for texts waiting to be processed. Each text preparer works on files within his or her working directory.
- Create the seven-letter abbreviated name that will be the text's unique ID, and add this abbreviation to the id= attribute of the <text> tag. Whenever possible, the ID should consist of three letters of the author's name and four of the title: Jane Austen's Emma has an id of AusEmma, for example.
- Identify the source edition for the
electronic text, obtain a copy of it (use Inter-Library Loan if
necessary). This identification may require you to contact the
creator of the text, if he or she is known. The printed source is
invaluable when checking the electronic document; we don't want to
be "correcting" things that look like errors but are actually
features of the printed text (British spelling, as an obvious
If no source edition is marked in the file, if the text's initial creator cannot be found, and if comparison with copies on the shelves of the Library yields no further information, then we need to decide whether we proceed with this text.
- Go to the TEI header webform template, and fill it in to the degree that it can be completed.
- Check the accuracy of the electronic text. You could, for example, run the Unix spell program to see if there are many words that Unix does not recognise, and check to see if they look like scanning or typing errors. A very corrupt text may need to be abandoned. Don't assume that a text with tags in it already is reliable in its content even if it is reliable in its markup -- there are a number of texts in our Modern and Middle English sections that came to us with TEI tags in place, but which had hundreds of typographical errors when we processed them [EXAMPLE] | [EXAMPLE].
- Check the structure of the electronic text.
Look for any structures that can be searched for and replaced with
TEI tags (existing word-processor codes, patterns of spacing and layout, etc.)
If the text contains no markup, look for repetitive patterns that can be
replaced with a tag (see Notes on Text Formatting).
For example, if five spaces at the beginning of a line always mark
a new paragraph, this pattern can be searched for and replaced with
</p><p>. Do not leave both the <p> marker and the five
spaces in the text -- think of the <p> like a TAB command in a
If the text contains some existing markup other than TEI, replace it. For example, if italics are marked with a pair of # marks (on and off), these can be searched and replaced using a routine that searches for #, replaces with <i>, goes to the next #, and replaces with </i>. If the text is already marked up with SGML tags (rare, at present), they may need to be converted to our subset of TEI. Remember that the Unix search and replace utility, SED, cannot be used if the item (such as an italicized phrase) is not all on one line. A Jove, Emacs, or WordPerfect macro may be your best bet.
- Look for the presence of line-end (and page-end) hyphenation.
Whenever possible, unambiguous line-end hyphenation is to be closed
up as it interferes with one's ability to search for the hyphenated
Line-end hyphenated words are considered to be unambiguous when they are hyphenated only because they fall at the end of a line. "Elec-|tronic" is unambiguous; "word-|processor" is not, as it might appear as one word, or as a hyphenated phrase, or as two words. If in doubt, leave the line-end hyphenation alone.
If removing unambiguous line-end (or page-end) hyphenation, move the second part of a word up to join its first part on the previous line. During such checking, be alert for missing lines and passages.
- The items that you can search and replace with SGML codes may not include the major text divisions (<div1> <div2> etc), in which case you will have to put these in manually. Remember that the first major division will be <div1>, the second <div2>, and so on. See A Practical Introduction to the Tag Set.
- Check for special characters, and convert to SGML entity references (see Special Characters).
- Paginate. This not only makes the text easier to navigate and cite from, but it also ensures its relative completeness, at least to the page level (watch out for short pages that may indicate a passage has been left out). If there are no page markers, and nothing to search for (2 blank lines, for example, or a control character), this has to be done manually. We have macros to put in the page markers and to add the numbers in the Etext Center.
- Spellcheck, if practical. If the text is from a source of electronic editions that we know to be generally reliable, a full spell-checking may be unnecessary, given our time constraints; spell-check a section and read through a section, to doublecheck. A huge file may simply be too time-consuming to spellcheck fully. Record what you do in the <teiHeader>.
- Make sure that there is a single space at the end of each line. The space is necessary as the TEI-to-HTML filter does not retain a hard return code, and therefore words run together if there is no line-end blankspace.
- Double-check the information in the TEI header.
- Unless in the process of the steps above the text is revealed to be irredeemably corrupt, it is now ready for parsing and indexing. Run multidocs to check the form of the tags. To parse, use nsgmls (aliased to the command "parse" on the etext machine).