Difference between revisions of "About XML"
(7 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | ==XML | + | ==XML at Huygens ING== |
− | + | Huygens ING uses XML to create many of its scholarly editions. XML is a plain text format that uses tags to delimit | |
+ | and name structural elements within the text and to provide additional information about these elements. | ||
+ | [https://www.w3.org/XML/ XML] was defined by the [https://www.w3.org/ W3C]. For scholarly editing (and many other purposes) | ||
+ | [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ Guidelines] for how to use XML were developed by the [http://www.tei-c.org/ Text Encoding Initiative] (TEI). | ||
+ | Huygens ING follows the TEI Guidelines whenever possible. | ||
− | + | This wiki documents the choices that we make in using the TEI Guidelines, which occasionally describe multiple ways of | |
+ | encoding a single phenomenon. Sometimes, the documents that we edit have special requirements that TEI does not yet support. | ||
+ | In the general pages here we describe the project independent choices that we made, in project-level pages we describe | ||
+ | project-specific additions and modifications. Often, these additions will be related to secundary, editorial material | ||
+ | rather than to original material, as the secundary material tends to be more project-specific than the primary material. | ||
+ | |||
+ | |||
+ | ==XML basics== | ||
+ | We clearly cannot provide a full introduction into XML here. Online training is available at many locations, | ||
+ | e.g. at [https://www.w3schools.com/xml/ W3schools]. The TEI also maintains a page with information [http://www.tei-c.org/Support/Learn/ for learning TEI]. Or you can attend formal training, e.g. at the [http://www.dhoxss.net/ Oxford summer school in Digital Humanities]. | ||
+ | |||
+ | Still, for those getting started at XML without patience for more formal training, a few XML basics: | ||
+ | |||
+ | * <starting tag>...</ending tag> (! attention to the position of / ), i.e. | ||
<pre><address>Prins Willem-Alexanderhof 5</address></pre> | <pre><address>Prins Willem-Alexanderhof 5</address></pre> | ||
− | + | * <tag without content/>, i.e. | |
<pre><lb/> linebreak</pre> | <pre><lb/> linebreak</pre> | ||
− | + | * An element can have an attribute. An attribute has a value " ": | |
<pre> | <pre> | ||
− | <hi rend= | + | <hi rend="super">st</hi> |
hi is the element (highlight) | hi is the element (highlight) | ||
rend is the attribute | rend is the attribute | ||
− | super is the value of the attribute. | + | super is the value of the attribute.</pre> |
− | </pre> | + | |
− | + | * @xml:id is an attribute that provides a unique identifier for an element. To point to it use "#". | |
+ | <pre> | ||
+ | E.g. a page surface may be defined as: <surface n="1r" xml:id="s1r"> | ||
+ | Later in the document we refer to the xml:id "s1r" using the pointer "#s1r" on the facs attribute of the page element (pb): | ||
+ | <pb f="1r" n="1" facs="#s1r"/></pre> | ||
− | + | * In an XML document, physical linebreaks ('enters'), even multiple ones, don't matter. | |
− | If you want a linebreak in a text you have to | + | If you want a linebreak in a text you have to encode one using <lb/>. |
<pre> | <pre> | ||
The line is too → The line is too short | The line is too → The line is too short | ||
Line 26: | Line 46: | ||
The line <lb/> is too short → The line | The line <lb/> is too short → The line | ||
is too short. | is too short. | ||
+ | |||
+ | (The arrow (→) shows what an encoded text will usually look like when published) | ||
</pre> | </pre> | ||
− | + | * XML documents are whitespace sensitive (whitespace includes spaces, tabs and newlines): | |
<pre> | <pre> | ||
normal<lb/>ly → normally | normal<lb/>ly → normally | ||
Line 36: | Line 58: | ||
</pre> | </pre> | ||
− | + | * Any comment (if you want to take a note on something for you or for other transcribers) is encoded in this way: | |
<pre> | <pre> | ||
<!-- This is a comment --> | <!-- This is a comment --> | ||
Line 46: | Line 68: | ||
===Namespaces=== | ===Namespaces=== | ||
− | Elements in XML documents can be placed in so-called namespaces. This is a way to be able to distinguish different XML vocabularies. The default namespace for elements in | + | Elements in XML documents can be placed in so-called [https://www.w3schools.com/xml/xml_namespaces.asp namespaces]. This is a way to be able to distinguish different XML vocabularies. |
+ | The default namespace for elements in Huygens ING projects is http://www.tei-c.org/ns/1.0 (the TEI namespace). For new elements, the namespace | ||
+ | that we will use is http://xmlschema.huygens.knaw.nl/ns/1.0. We refer to the namespace with the prefix 'hi', for exemple <hi:addressee> | ||
− | + | (Existing projects still have there own namespaces, e.g. for Mondrian: http://mondrian.huygens.knaw.nl/. These project-specific namespaces will be phased out. | |
− | <pre><TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns: | + | We define prefix and namespaces on the root level element of the XML document, as follows: |
+ | |||
+ | <pre><TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:hi="http://xmlschema.huygens.knaw.nl/ns/1.0"></pre> | ||
===Schema=== | ===Schema=== | ||
− | + | In XML editing, a so-called schema is used to provide the editor with information about suitable elements, | |
− | + | attributes and attribute values in the context that he or she is editing. Schemas are indispensable for proper | |
− | The schema is stored | + | editing as they help assure the XML files conform to the agreed encodings. |
+ | |||
+ | There are multiple schema languages. We use the [http://relaxng.org/ Relax NG] language (pronounce: 'relaxing') in its XML notation. | ||
+ | |||
+ | For each of our projects, we define a specific schema (or multiple specific schemas, if we have multiple types of documents). | ||
+ | The schema is defined by the project's XML consultant. It is generated from a sourcefile (a so-called ODD-file) maintained in the project's | ||
+ | repository. The generated schema is stored there as well. | ||
+ | |||
+ | The schemas will be placed on the web on e.g. (Mondrian example): http://xmlschema.huygens.knaw.nl/md.rng | ||
+ | |||
+ | ===Chained schemas (in development)=== | ||
+ | |||
+ | We will be working towards a situation where a single schema defines the Huygens TEI subset-with-additions. Project-specific schemas will be derived from the Huygens schema. | ||
+ | |||
+ | We start with the Mondrian (letter) schema, modified so as to use the correspDesc elements now available in TEI and to include the features needed for the ePistolarium. | ||
− | |||
− | + | ||
+ | ==See also== | ||
+ | * [[XML Huygens]] | ||
− | + | [[Category:XML Huygens]] |
Latest revision as of 09:33, 12 March 2018
Contents
XML at Huygens ING
Huygens ING uses XML to create many of its scholarly editions. XML is a plain text format that uses tags to delimit and name structural elements within the text and to provide additional information about these elements. XML was defined by the W3C. For scholarly editing (and many other purposes) Guidelines for how to use XML were developed by the Text Encoding Initiative (TEI). Huygens ING follows the TEI Guidelines whenever possible.
This wiki documents the choices that we make in using the TEI Guidelines, which occasionally describe multiple ways of encoding a single phenomenon. Sometimes, the documents that we edit have special requirements that TEI does not yet support. In the general pages here we describe the project independent choices that we made, in project-level pages we describe project-specific additions and modifications. Often, these additions will be related to secundary, editorial material rather than to original material, as the secundary material tends to be more project-specific than the primary material.
XML basics
We clearly cannot provide a full introduction into XML here. Online training is available at many locations, e.g. at W3schools. The TEI also maintains a page with information for learning TEI. Or you can attend formal training, e.g. at the Oxford summer school in Digital Humanities.
Still, for those getting started at XML without patience for more formal training, a few XML basics:
- <starting tag>...</ending tag> (! attention to the position of / ), i.e.
<address>Prins Willem-Alexanderhof 5</address>
- <tag without content/>, i.e.
<lb/> linebreak
- An element can have an attribute. An attribute has a value " ":
<hi rend="super">st</hi> hi is the element (highlight) rend is the attribute super is the value of the attribute.
- @xml:id is an attribute that provides a unique identifier for an element. To point to it use "#".
E.g. a page surface may be defined as: <surface n="1r" xml:id="s1r"> Later in the document we refer to the xml:id "s1r" using the pointer "#s1r" on the facs attribute of the page element (pb): <pb f="1r" n="1" facs="#s1r"/>
- In an XML document, physical linebreaks ('enters'), even multiple ones, don't matter.
If you want a linebreak in a text you have to encode one using <lb/>.
The line is too → The line is too short short The line <lb/> is too short → The line is too short. (The arrow (→) shows what an encoded text will usually look like when published)
- XML documents are whitespace sensitive (whitespace includes spaces, tabs and newlines):
normal<lb/>ly → normally normal <lb/> ly → normal ly normal<lb/> ly → normal ly
- Any comment (if you want to take a note on something for you or for other transcribers) is encoded in this way:
<!-- This is a comment --> <!-- This is a strange sign that I cannot read, it's better to leave it now and come back to it in a week or two --> <!-- This encoding is for hyphenation -->
XML namespace and schema
Namespaces
Elements in XML documents can be placed in so-called namespaces. This is a way to be able to distinguish different XML vocabularies. The default namespace for elements in Huygens ING projects is http://www.tei-c.org/ns/1.0 (the TEI namespace). For new elements, the namespace that we will use is http://xmlschema.huygens.knaw.nl/ns/1.0. We refer to the namespace with the prefix 'hi', for exemple <hi:addressee>
(Existing projects still have there own namespaces, e.g. for Mondrian: http://mondrian.huygens.knaw.nl/. These project-specific namespaces will be phased out.
We define prefix and namespaces on the root level element of the XML document, as follows:
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:hi="http://xmlschema.huygens.knaw.nl/ns/1.0">
Schema
In XML editing, a so-called schema is used to provide the editor with information about suitable elements, attributes and attribute values in the context that he or she is editing. Schemas are indispensable for proper editing as they help assure the XML files conform to the agreed encodings.
There are multiple schema languages. We use the Relax NG language (pronounce: 'relaxing') in its XML notation.
For each of our projects, we define a specific schema (or multiple specific schemas, if we have multiple types of documents). The schema is defined by the project's XML consultant. It is generated from a sourcefile (a so-called ODD-file) maintained in the project's repository. The generated schema is stored there as well.
The schemas will be placed on the web on e.g. (Mondrian example): http://xmlschema.huygens.knaw.nl/md.rng
Chained schemas (in development)
We will be working towards a situation where a single schema defines the Huygens TEI subset-with-additions. Project-specific schemas will be derived from the Huygens schema.
We start with the Mondrian (letter) schema, modified so as to use the correspDesc elements now available in TEI and to include the features needed for the ePistolarium.