About XML

From XML
Revision as of 12:46, 25 April 2017 by PBO (Talk | contribs)

Jump to: navigation, search

XML at Huygens ING

Huygens ING uses XML to create many of its scholarly editions. XML is a plain text format that uses tags to delimit and name structural elements within the text and to provide additional information about these elements. XML was defined by the W3C. For scholarly editing (and many other purposes) Guidelines for how to use XML were developed by the Text Encoding Initiative (TEI). Huygens ING follows the TEI Guidelines whenever possible.

This wiki documents the choices that we make in using the TEI Guidelines, which occasionally describe multiple ways of encoding a single phenomenon. Sometimes, the documents that we edit have special requirements that TEI does not yet support. In the general pages here we describe the project independent choices that we made, in project-level pages we describe project-specific additions and modifications. Often, these additions will be related to secundary, editorial material rather than to original material, as the secundary material tends to be more project-specific than the primary material.


XML Basics

We clearly cannot provide a full introduction into XML here. Online training is available at many locations, e.g. at W3schools. The TEI also maintains a page with information for learning TEI. Or you can attend formal training, e.g. at the Oxford summer school in Digital Humanities.

Still, for those getting started at XML without patience for more formal training, a few XML basics:

  • <starting tag>...</ending tag> (! attention to the position of / ), i.e.
<address>Prins Willem-Alexanderhof 5</address>
  • <tag without content/>, i.e.
<lb/> linebreak
  • An element can have an attribute. An attribute has a value " ":
 	<hi rend="super">st</hi>
	hi 	is the element (highlight)
	rend 	is the attribute
	super 	is the value of the attribute.
  • @xml:id is an attribute that provides a unique identifier for an element. To point to it use "#".
	E.g. a page surface may be defined as: <surface n="1r" xml:id="s1r">
	Later in the document we refer to the xml:id "s1r" using the pointer "#s1r" on the facs attribute of the page element (pb):
	<pb f="1r" n="1" facs="#s1r"/>
  • In an XML document, physical linebreaks ('enters'), even multiple ones, don't matter.

If you want a linebreak in a text you have to encode one using <lb/>.

The line is too			→	The line is too short
short				

The line <lb/> is too short	→ 	The line
					is too short.

(The arrow (→) shows how an encoded text will look like when published)
  • XML documents are whitespace sensitive (whitespace includes spaces, tabs and newlines):
normal<lb/>ly		→ 	normally
normal <lb/> ly 	→ 	normal  ly
normal<lb/>
ly 			→ 	normal  ly
  • Any comment (if you want to take a note on something for you or for other transcribers) is encoded in this way:
<!--  This is a comment  -->
<!-- This is a strange sign that I cannot read, it's better to leave it now and come back to it in a week or two -->
<!-- This encoding is for hyphenation -->

XML namespace and schema

Namespaces

Elements in XML documents can be placed in so-called namespaces. This is a way to be able to distinguish different XML vocabularies. The default namespace for elements in this document is http://www.tei-c.org/ns/1.0 (the TEI namespace). For new elements (normally borrowed from DALF), the namespace is http://mondrian.huygens.knaw.nl/. We refer to the namespace with the prefix 'md', for exemple <md:addressee>.

Prefix and namespace are defined on the root level element of the XML document, as follows:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:md="http://mondrian.huygens.knaw.nl/">

Schema

[Geldt zo alleen voor Mondriaan? Nog algemeen maken] A schema was defined, in which we specify which elements and attributes are allowed. The schema uses the Relax NG schema language. The name is MD.rng. oXygen uses the schema to validate the file and suggest allowed elements and attributes. The schema is stored in the documentation folder in the repository. Our documents refer to the schema as follows :

<?xml-model href="../documentation/MD.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>

(Read the href attribute as: one folder up (the two dots), then down into the documentation folder)

For the writings, we use a separate schema: MDwritings.rng.

See also