Guidelines for SGML Text Mark-up at the Electronic Text Center
David Seaman, Electronic Text Center, University of Virginia
ETEXT Center Guidelines for the Creation of
Etext staff: if in doubt see David Seaman for guidance on the settings at which
to scan an image.
We have eight years of experience in the creation of digital
copies of book illustrations, typescript, and manuscript, so don't try to
"go it alone". The .tiff files may be sizeable -- don't be offput, and
especially don't be tempted to scan at too low a resolution
(or God forbid, at 8-bit colour), just because a tiff is a big file.
And we don't want to have to re-scan at a later date.
The tiffs go off onto a CD as soon as we have made
jpeg and gif versions for current everyday use.
Rules and Regulations of Image Scanning and
The following list explains the items we typically scan, their
specifications for scanning, and how to name them
electronic text database at the Etext Center:
- What Typically Warrants Scanning
- Images of the spine, front cover, end-papers (ONLY if visually
interesting), frontispiece (if there is one), and title-page.
- All other images in the text or anything that warrants
visual interest--including ornamental capitalizations and small images
embedded in the text itself.
- Scanning Specifications
- At the Etext Center, when we say an "image" we mean the entire page
upon which the image is placed even if it is something as small as an
ornamental capitalization. When you draw your box around the image that
you want to scan, leave a few millimeters on each side of the page so
the viewer can better appreciate the three-dimensionality of the book as
a physical object.
- All images are scanned and saved as 400 dpi (dots per inch), 24-bit
color tiffs. See
Special Collections Image Scanning
for more information.
- Image Naming Conventions
- Again, all images will be saved in uncompressed tif format.
- An image name can have no more than 8 characters as some of our
work is done in the MS-DOS environment. These characters
can only be numbers and letters--no punctuation.
- At Etext, we typically name images so that they will correspond to
the texts they are a part of.
Example: if you are tagging the frontispiece, the titlepage, and an
illustration on page 122 in Booth Tarkington's The
Flirt (a work whose UVa ID is TarFlir) you would name these
images as follows:
Page 122: "TarFl122"
Illustration and Image Tags
The following tags are used to tag illustrations and information that
goes with illustrations.
<figure> </figure>The <figure> tag pair indicates the location of a graphic,
illustration, or figure. The filename for the digital image is
given as the value of an entity= attribute.
"Entity" specifies the file in which the graphic
image of the figure is stored. Do not include a suffix denoting
the image type (e.g. FILENAME.gif). Usually, we will name the
image file using as much of the work's unique ID as possible,
and the page number on which the illustration occurs.
As some of our work is done in the MS-DOS environment, the image
filename should not be longer than eight characters.
So, for an illustration from Booth Tarkington's The Flirt (a
work whose UVa ID is TarFlir) the entity value for an
illustration on page 122 would read as follows:
The <head> tag may be used to transcribe (or supply) a
heading or title for the graphic itself:
<head> "Kiss me some more darl----"</head>
The <figDesc> tag is important. The tag contains a brief prose description of the
appearance or content of a graphic figure. The reason it is necessary
to have is because the information in this tag allows the user to search
for information within a particular illustration.
Click here to see the image.
<head>"Kiss me some more darl----"</head>
<figDesc>Grayscale illustration of a young girl trying to kiss a
boy, under moonlight. </figDesc>
- Note: if it is possible to use terms from the following control
vocabulary, that would be to our advantage: The
Graphics Materials, consisting of
5,997 terms and numerous cross references indexing visual materials.
is a companion document to
Thesaurus for Graphic Materials II: Genre and
Physical Characteristic Terms
- You may also have one or more paragraphs following
the <head> and preceding the <figDesc> to transcribe
any additional text relating to the figure found in the print source.
The <head> and <figDesc> fields are valuable sets of information
for PAT searches -- as the set of etext images grows, they will allow
a user to search image captions, and descriptions of those images.
For a WWW user coming to the data through a VT100 client such as Lynx,
the field should be able to be sent as an alternative to the
Other Simple Examples
here to see image.
An engraved portrait of Dorothea
posed thoughtfully at a writing table. Three stacked books
stand in the right foreground. Dorothea's right hand holds a
<head>Mr. Casaubon and Dorothea</head>
<figDesc>An engraving by W.L. Taylor showing Mr.
Casaubon and Dorothea, presumably in their "hour's
<hi>tête-à-tête</hi>." Casaubon sits in an
upholstered wooden chair in the left background corner, facing
the viewer, with Dorothea's right hand in his own. Dorothea
sits on a footstool at center-right, turned towards Casaubon.
The left quarter of her face is visible to the viewer. The
setting is a sunny room with one curtained window and one
uncurtained, open window behind the figures.
here to see image.
SGML Text Embedded in Image Files
A growing number of our electronic texts have book illustrations
and other book-related images along with the tagged ASCII text. To
include an attribution record in these book illustrations we bury
a version of the TEI header into the binary code of the image.
The user who saves an image from a text on our etext server now gets --
in Trojan Horse fashion -- a tagged full-text record of the creation
of that image as part of the single image file they save. The image
header and related <figDesc> information gives us a searchable SGML text
database for our images.
For a description of an early implementation of "text in images", see
"Campus Publishing in Standardized Electronic Formats: HTML and TEI."
in Scholarly Publishing on the Electronic Networks,
Specific Procedures for Adding Image Headers
Image Processing on Unix: ImageMagick
The mogrify part of this impressive Unix tool allows us to perform
batch image conversions from one format to another (e.g., TIFF to JPEG)
and to add tagged text headers into the images as we convert.
ImageMagick, is available from
and is on the UVa etext machines. See the ImageMagick
README file for more information.
For an interactive on-line implementation of ImageMagick, see the Image
Step by Step Instructions for UVa Etext processors
1. Use the new TEI header template in etext/Done; it has several new fields:
To add a header to an image:
a) Just after the "Creation of machine-readable version:" field, there are two lines to
indicate who created the digital images.
b) The first note field should be used to indicate the existance of images; also note if the
images come from a different source than the print text.
c) In the <editorialDecl> section, there's now a standard indication about how we store
d) There's an extra <textClass> section which includes keywords and terms to indicate
the artist, the type of visual work, and the type and dpi of the digital image; modify those
fields as appropriate (i.e., if you have a 24-bit color image at 400 dpi, that's the only
information that should appear in that field).
- Make a copy of the completed TEI header for the text in question
- put a hash mark and a space at the beginning of every line:
# <title>blah [a machine-readable transcription]</title>
The hash marks are necessary for some image viewers. This text is
now ready to go into the image(s).
You can now simultaneously convert your tifs to jpgs and add in the
header information above to those jpgs.
If the header text file is called AutWork.header,
and your various tiff files are image1.tif, image2.tif,
image3.tif, and image4.tif, then this is what you do:
- a) Make sure the tiff files are in the same directory in which you
are doing your mogrification.
- b) Type the following command:
mogrify -format jpg -quality 50 -geometry 30% image*.tif
You have now converted all the image*.tif files into
image*.jpg files, and those .jpg files have the textual information from the
header embedded within them; the .tif files have remained unchanged. (You can
view the text in the images by viewing the .jpg files in xv, calling up
the control window, and choosing the "comments" button.)
If you want textual information that's specific to one particular
image, you need only do the following:
- 1. Repeat step 1 above.
- 2. Repeat step 2, but add the following into the text after
Fill in the fields with the information appropriate to the individual
image. (These tags will also need the hash mark and space before them.)
Repeat step 3 above.
Image Processing on the Mac: ADDJFIFcomment
- 1. Move a jpg to the Mac; save it again as a jpg using JPEGView -- this
process will only work with a Mac conformant jpg.
- 2. Once you have a Mac jpg, call up the ADDJFIFcomment application;
type in your text, and select "add"; then select the jpg file to which you
would like to add comments.
- 3. NOTE: if you want comments in a gif as well, follow steps 1 and 2,
and then call that new jpg into JPEGView and save as a gif;
ADDJFIFcomment won't take anything but a Mac conformant jpg.
Alternative, and much less preferable methods, used before
- 1. Call up the image in xv, and save it in PBM (ascii) format; it will
assign either a .ppm or .pgm suffix depending on whether the file is
color or greyscale.
- 2. Issue the following command:
csplit -f pnum file.pgm 02
csplit -f pnum file.ppm 02
This will result in two output files: pnum00 and pnum01. These two
files are your original file.pgm split into two: the first line
and everything following the first line. We want to insert the header
after the first line in the .ppm or .pgm file.
- 3. Concatenate the header and the two "pnum" files in the
following order, to create a new file (here called "file-2.pgm"):
cat pnum00 text.header pnum01 >file-2.pgm
4. Call up file-2.pgm in xv and save back to JPEG, or convert to GIF;
the text remains embedded.
NOTE: The text header must have a pound symbol and a space at the beginning of
# text of header goes here
| Back | Next |