Candle Data Model Reference

Version	: Candle 0.10
Published date	: Nov 12, 2011

1. Introduction

Candle data model is closely based on XQuery and XPath Data Model (XDM) and XML Schema.

Data types in Candle can be divided into 4 major categories:

Atomic data types: including boolean, numeric types, string, binary stream, uri, qname, id, datetime and measure;
Node data types: including markup nodes - element, attribute, text, data, comment, script nodes - routine and type, and document and directory nodes;
Collection data types: including sequence and map;
Empty and error data types;

2. Candle Data Type Hierarchy

Unlike XML, which is only semi-structured when there is no schema, Candle's literal data is always strongly typed. The type information of an item can be derived from the literal syntax. The detailed syntax of the Candle literal values is specified in Candle Markup Reference.

Most of the Candle types are based on XML Schema types. Below is the type hierarchy of Candle data model:

Figure 1: Candle Type Hierarchy

(Note that some of the data types, including

error,
duration, type, directory

and map, are not supported in current beta release).

2.1 Data Type Categories

Candle data types can be divided into 4 major categories:

Atomic data types are built-in primitive data types that cannot be further decomposed into other primitive types; however, atomic values do have multiple facets defined on them, that return specific aspects of the value, like namespace, local name, string value, unit of measure, second of time, year of date, etc.
Node data types are complex data types that are composed of other primitive data types in a hierarchical manner. Candle introduced a few new node types, including scripting types routine, type, so that they can be treated as first-class data as in many functional programming languages; and also the directory node type, so as to extend the path semantic to the file system level.
Collection data types holds a sequence of items. An item is either an atomic value or a node. There are two types of collections in Candle:

Sequence is a linear collection of items, which is ordered by default. The items in a sequence can be access by numeric index.
Map is a linear collection of key-item pairs, which is also ordered by default. The items in a map can be access by either the numeric index or the key. The key of a map item is always an atomic value.

Empty and error data types.

3. Data Type Characteristics and Components

3.1 Data Type Characteristics

When data model is concerned, we should differentiate data types by unique characteristics in their data values, rather than the syntax. Some of the important characteristics of the data types are: cardinality, node identity, node reference and document order.

Cardinality and Sequences

The cardinality of a data instance is the count of data items in the instance. If the instance is an item, either atomic or a node, then its cardinality is 1. If the instance is a sequence, then the cardinality is the count of the items in the sequence.

A sequence in Candle must have at least 2 items; otherwise, it is an item, not a sequence. Sequences never contain other sequences; if sequences are combined, the result is always a flattened sequence. In other words, appending (d, e) to

(a, b,
c)

produces a sequence of cardinality 5: (a b c d e). It does not produce a sequence of cardinality 4: (a b c (d e)), such a nested sequence never occurs.

Node Identity and Node Reference

The most important characteristic that tells atomic types apart from node types is node identity.

Atomic values do not have identity. Every instance of the value 5 as an integer is identical to every other instance of the value 5 as an integer.

Each node has a unique identity, whether its constructed dynamically or loaded from some data source. Every node in an instance of the data model is unique: identical to itself, and not identical to any other node.

Node reference is a value used to refer to a node. In Candle, node reference is always expressed as URI, following schemes like IDREF or XPointer. Different node references may refer to the same node.

Only nodes loaded from data sources can be addressed through node reference. Atomic values and nodes constructed dynamically cannot be addressed through any node reference.

Document Order

[Definition: A document order is defined among all the nodes accessible during a given query or transformation. Document order is a total ordering, although the relative order of some nodes is implementation-dependent. Informally, document order is the order in which nodes appear in the serialization of a document.] [Definition: Document order is stable, which means that the relative order of two nodes will not change during the processing of a given query or transformation, even if this order is implementation-dependent.]

Within a tree, document order satisfies the following constraints:

The root node is the first node.
Every node occurs before all of its children and descendants.
Attribute Nodes associated with that element immediately follow the element. The relative order of Attribute Nodes is stable but implementation-dependent.
The relative order of siblings is the order in which they occur in the children property of their parent node.
Children and descendants occur before following siblings.

The relative order of nodes in distinct trees is stable but implementation-dependent, subject to the following constraint: If any node in a given tree, T1, occurs before any node in a different tree, T2, then all nodes in T1 are before all nodes in T2.

3.2 Data Type Components and Accessors

Besides the important characteristics of the data types, data types also have components. Data components are values stored in a data type. Accessors are defined to retrieve various components stored in a data type.

Even an atomic type can have multiple data components. When we say a type is atomic, we actually means it is non-hierarchical, or flat, in contrast to a node type; an type is atomic does not mean there's only one value in it. For example, a datetime instance 't:003-01-02T11:30:00-05:00' have several value components in it, (2003, 1, 2, 11, 30, 0.0, -PT05H00M).

The accessors defined in Candle are:

Accessor	Prototype	Semantic
string-value accessor	`?string as string`	- returns the string value of the context item;
typed-value accessor	`?value as atomic*`	- returns the atomized value(s) of the context item;
qualified-name accessor	`?qname as qname`	- returns the qualified name of the context item;
namespace accessor	`?namespace as uri`	- returns the namespace URI of the context item as `uri`;
kind accessor	`?kind as qname`	- returns the built-in type name of the context item;
type-name accessor	`?type as qname`	- returns the type name of the context item;
root node accessor	`?root as node`	- returns the root node containing the context item;
parent node accessor	`?parent as node`	- returns the parent node containing the context item;
previous node accessor	`?previous as node`	- returns the previous node of the context item;
next node accessor	`?next as node`	- returns the next node of the context item;
attributes accessor	`?attributes as node*`	- returns the attribute nodes of the context item;
children accessor	`?children as node*`	- returns the child nodes of the context item;
URI accessor	`?uri as uri`	- returns the URI that can be used to address the context item;
number accessor	`?number as number`	- returns the numeric value if the type is a numeric type or a measure type;
unit accessor	`?unit as qname`	returns the unit of a measure as `qname`;
datetime components accessor	`?year, ?month, ?day, ?week-day, ?hour, ?minute, ?second, ?millisecond as integer`	- returns the corresponding component of the `datetime` type as `int`;
color components accessor	`?r, ?g, ?b as integer`	- returns the r, g, or b component of a `color` as `int`;
source accessor	`?source as string`	- returns the serialized source text of the context item;

(Some accessors, including ?root, ?parent, ?attributes, ?children, ?uri are not implemented in current beta release).
To be consistent, the accessors are defined on all node types, while they may return empty value on certain data types.

4. Data Model Serialization

Documents in various formats (XML, XHTML, HTML, JSON, MIME, CSV, plain text, etc.) are mapped into and processed under the unified data model during script evaluation. Candle's data model can be seen as the superset of all these source formats. The input format differences only matter during serialization and deserialization.

An input document is serialized back into its original format by default. Candle ensures there's no data loss in the data model during the serialization and deserialization. However, non-significant syntaxtual information (like whitespaces, case of HTML element names) might be normalized during the process.

Candle supports transformation of one document format to another, but there might be data lose in such transformation, due to the limitations of the format. For example, namespace information is lost when you serialize into HTML format.

Some of the Candle data types cannot be serialized into a markup document:

Script data types: in Candle, functions and patterns are first-class data as in functional programming languages. However, code data types cannot be serialized into Candle markup documents. For an object, if it contains code data, only the non-code data can be serialized.
Error data type: error data types can never be serialized into a markup document. If a dynamically constructed document has error in it, the errors will be ignored and not serialized. This restriction is simply because there's no literal representation of error data type in Candle. Error is a data type used only during runtime.

Appendices

A. Candle Data Types vs. XQuery Data Types

While Candle data types are closely based on XQuery and XML Schema data types, there are substantial differences between them. After the comparison, you'll see that Candle's data model is cleaner and more general. For your convinience, the diagram of XQuery Type Hierarchy is included here:

Figure 2. XQuery Type Hierarchy
Types that have direct correspondence between Candle and XQuery are:

boolean: same as xs:boolean;
numeric types:

byte: same as xs:byte;
ubyte: same as xs:unsignedByte;
short: same as xs:short;
ushort: same as xs:unsignedShort;
int: same as xs:int;
uint: same as xs:unsignedInt;
long: same as xs:long;
ulong: same as xs:unsignedLong;
integer: same as xs:integer;
decimal: same as xs:decimal;
float: same as xs:float;
double: same as xs:double;

string: same as xs:string;
binary: combines xs:base64Binary and xs:hexBinary;
uri: same as xs:anyURI;
qname: same as xs:QName;
id: same as xs:ID;
datetime types:

date-time: same as xs:dateTime;
date: same as xs:date;
time: same as xs:time;
year-month: same as xs:gYearMonth;
year: same as xs:gYear;
month-day: same as xs:MonthDay;
month: same as xs:gMonth;
day: same as xs:gDay;

Though these types have same data model, but their literal syntax in Candle and XQuery can be different. You should refer to Candle Markup Reference for the detailed syntax.

There are a few new types introduced by Candle that do not have corresponding XQuery and XML Schema data types:

measure types:

they are introduced primarily due to their wide usage in presentation markup languages, like HTML, CSS and SVG;
although in current implementation, only predefined units can be used, we may in future to allow user defined units to be used, so that measurements can be used in a general manner;

map: is commonly used in many scripting languages;
directory: so as to extend our path semantic to the document level;
script types: for functions and types to be treated as first-class data as in many functional languages;

Some data types defined in XQuery and XML Schema are excluded by Candle:

xs:untyped: as Candle is always strong typed, there's no need for untyped data type in Candle;
some derived types: like xs:nonNegativeInteger, xs:positiveInteger, xs:nonPositiveInteger, xs:negativeInteger, are not defined in Candle, as they can be easily implemented as user-defined types in Candle;
xs:IDREF: is unified under uri type in Candle;
namespace nodes are treated as pure lexical information instead of nodes in Candle data model;
processing-instruction: is treated as comment in Candle;
DTD related types: like xs:language, xs:ENTITY, xs:NOTATION, xs:NMTOKEN are not supported by Candle;