"

XML AND TEI

The Extensible Markup Language (XML) is not a programming language.  It is a markup language that consists of a set of codes that describe text in a digital or digitized document. The Hypertext Markup Language (HTML) is another example of a markup language employed to represent text, graphics, and other media on web browsers, and to control the appearance of a web page.  However, while HTML primarily concerns web pages and their appearance, and is therefore primarily presentational, XML describes the content of text, and is therefore semantic.  Also unlike HTML, XML does not specify how a document is displayed.  Such display issues are addressed through separate style sheets. XML, although it is called a language, does not allow any operations to be performed, such as those that can be implemented with a programming language.  XML simply wraps or contains information through tags.  Another crucial difference between the two markup languages is that XML does not have predefined tags, as does HTML.  Tags are created by the user and are not in an XML specification. can be used to represent longer strings of text.

 

XML is very useful for sharing data and for transferring data to several users or members of an organization.  It also facilitates searching, as searchable tags can be used to represent longer strings of text.    Because it is readable by both humans and computers, it is used in many research, business, and digital humanities applications.  It also is hierarchical, allowing data to be represented in a tree data structure consisting of a root, branches, and leaves at different levels.  For instance, a book may be represented by a root node in a tree data structure.  Under this root, another node can represent the title page, copyright information, etc.  Still under the root, and a sibling to the title page node, another node can represent the table of contents of the book.  Under the table of contents node, individual chapters can be represented.  Under the chapter nodes, nodes can be used to represent individual subsections of the chapter.  In this way XML allows data to be structured hierarchically (See the article by Birnbaum).

 

XML consists of four main categories of content.    Element content can contain only other elements.  A list is an example of element content.  Text content contains only plain text.  Mixed content consists of plain text and other content, including other tags.  Finally, an empty element does not have text associated with it.  Empty elements are used as placeholders, bookmarks, or milestones.  Elements are qualified through attributes that provide markup features for additional or supplementary information about an element (See what-is-xml).

 

In the digital humanities, XML is used to represent documents.  Its two main advantages are that XML provides a formal model to represent an ordered hierarchy and is therefore conducive to denote structured documents; although it readable and interpretable by human users, XML is also machine readable, and algorithms can perform operations on these ordered hierarchies, or trees, very efficiently, more so than unstructured or non-hierarchical representations.  Consequently, as documents may be modeled as hierarchies, a large amount of text or large number of documents, or even an entire corpus, can be efficiently processed.  XML therefore facilitates large-scale processing and analysis.

 

The motivation and concepts of the Text Encoding Initiative (TEI) were described in a previous section.  TEI is technically a “community of practice”.  where a community of practice is “a group of people who share a common concern, a set of problems, or an interest in a topic and who come together to fulfill both individual and group goals”.  The TEI Guidelines for Electronic Text Encoding and Interchange defines the specification for this markup language, which is a type of XML format.  In the TEI Guidelines, approximately 500 TEI elements are documented, including a word (TEI element word, with tag <w>), sentence (TEI element s-unit, with tag <s>), and person (TEI element person, with tag <person>).  Many large projects in the digital humanities use TEI, including The Walt Whitman Archive, The Rossetti Archive, The World of Dante, and The Encyclopedia of Chicago.  A partial list of large-scale projects can be found on the TEI website.  A number of software tools are available for creating, editing, processing, and publishing TEI documents, including web-based systems.  A Python TEI reader, and other Python TEI programs can be obtained and modified.  Tools for importing and parsing TEI documents for further analysis are also available in R.

[NEXT]

License

Icon for the Creative Commons Attribution-ShareAlike 4.0 International License

Contemporary Digital Humanities Copyright © 2022 by Mark P. Wachowiak is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, except where otherwise noted.