DataSet XML Format

From Eigenvector Documentation Wiki
Jump to: navigation, search

This page describes the XML format to construct a DataSet object. The DataSet object is a container for scientific data which permits the storage of numerical values (known as the data) along with the typical associated contextual information.

For the purposes of this appendix, it is important to note that the DataSet object allows the inclusion of one or more sets of textual labels to be associated with each column and/or row of a data matrix. In addition, numerical "axis scale" values can also be associated with each column or row.

It should be noted that the object has a significant amount of flexibility beyond what this document will discuss. For additional information on the fields and construct of the DSO, the user is directed to the object's documentation.

By convention in Solo and PLS_Toolbox, each row of a data table is considered a "sample" (or "observation") and the columns of a data table are the variables measured on each sample. Thus, to create a typical DataSet object (DSO) which can be used to make a prediction, a DSO will be created around a single row of values. An XML construct of a DSO will, therefore, always contain at least a tag to describe the data.

Numerical values for an XML construct of a DSO are given in comma-separated and semicolon-separated format. Commas indicate values on the same row of a matrix (item,item,item); Semicolons indicate row-wise breaks (row; row; row). White space is always ignored.

Basic DSO XML

The basic XML DSO construct consists of the outer object tag with a "class" attribute indicating that the object is a DSO. There are actually two formats for creating a DataSet objects. One uses class="dataset" and is a complete and complicated description of a dataset object. The other uses class="dso" and is much easier to create. We recommend class="dso" for most applications and will not discuss class="dataset" in this document.

The outer tag must contain a tag which will always have the class="numeric" attribute (because data will always be numeric).

<obj class="dso">
  <data class="numeric"> 1,2,3,4,5 </data>
</obj>

This XML construct would create a simple DSO containing the values 1 to 5 in a row vector. If the DSO being created should contain multiple rows, a semicolon should be used after each row of numbers. Note that all rows must contain the same number of elements. The following would create a data matrix with 3 rows and 5 columns:

<obj class="dso">
  <data class="numeric"> 11,12,13,14,15; 21,22,23,24,25; 31,32,33,34,35 </data>
</obj>

DSO with Labels

In most cases, it is desirable to associate some contextual information regarding the variables which are being passed to the predictor. This is often expressed as either textual labels, indicating the measured parameter (often giving the name of purpose of the device: "thermocouple A") or numeric axis scale values (often used in spectroscopy, electrochemistry, time-based measurements, etc.) These contextual data will be used by Solo_Predictor to help align new data to a model, verify that the new data has all the expected variables, and replace those which are missing.

To include labels in a DSO, an additional <label> tag must be added to the XML description. The label tag can contain one or more label "sets" each enclosed in a <set> tag. Each set contains three elements: mode, name, and content. The mode tag indicates the data mode (1=rows/samples, 2 = columns/variables) for which the label set is being defined. The name tag is optional but, if present, indicates a name to associate with this label set. The content tag defines the actual labels for each element on the given mode. Each label must be enclosed in its own separate <sr> tag and there must be an appropriate number of tags for the number of columns or rows (whichever mode the labels are being associated with). For example, the following creates the labels "A" through "E" for the five columns of our example data:

<obj class="dso">
  <data class="numeric"> 1,2,3,4,5 </data>
  <label>
    <set>
      <mode>2</mode>
      <name class="string">example variable labels</name>
      <content class="string">
        <sr>A</sr>
        <sr>B</sr>
        <sr>C</sr>
        <sr>D</sr>
        <sr>E</sr>
      </content>
    </set>
  </label>
</obj>

A label can be added for the sample (first mode) by including an additional <set> tag inside the <label> tag (before or after the <set> tags already included above):

  <set>
    <mode>1</mode>
    <name>example sample label</name>
    <content class="string">
      <sr>This is my one sample</sr>
    </content>
  </set>

Although the <content> tag uses the <sr> tags to enclose the string, this is not necessary in this case. Any time a single string value is being created, the <sr> tags can be omitted as can the class attribute. Thus the content tag could have read:

   <content>This is my one sample</content>

DSO with Axis Scale

Numeric axis scale values can be added using an axisscale tag (note the tag name does not contain a space) with similar content to the label tag. The only difference is that the axisscale property expects a numeric value so the <content> tag is defined with the class="numeric" attribute and the values are supplied as a comma-separated values list. The following defines an axisscale for the variables running from 500 to 508 in steps of 2:

<obj class="dso">
  <data class="numeric"> 1,2,3,4,5 </data>
  <axisscale>
    <set>
      <mode>2</mode>
      <name>example axis scale</name>
      <content class="numeric"> 
         500,502,504,506,508
      </content>
    </set>
  </axisscale>
</obj>

As with labels, note that the number of items defined in the content must match the length (number of elements) of the given mode (columns in this example).

Other DSO Properties

Most of the remaining DSO properties (fields) can be set using similar calls. For example, classes and titles (see the DataSet object documentation for more information on these fields) can be added to the DSO using tags similar to label and axisscale. Titles must have content of class="string" and must contain a single string. Classes can have numeric or string content and must have sufficient elements to match the size of the given mode.

In addition, the include field uses the <set> notation described above and the author, name, description, and userdata fields all use the single-tag notation (as with the data tag where the field name is given with the class attribute and the content within the tag). For example, see below:

<obj class="dso">
  <data class="numeric"> 1,2,3,4,5 </data>
  <name class="string">Name for Dataset</name>
  <author class="string">Dataset\'s Author</author>
  <description class="string">
    <sr>Include a multi-line string here</sr>
    <sr>Use as many sr tags as you have lines</sr>
  </description>
</obj>

Note the use of the backslash in front of the single quote included in the Author tag. This is only necessary when passing XML through the Solo_Predictor interface. When XML is saved to a file, backslashes are not needed.

<obj class="dso">
  <data class="numeric">1,2,3,4,5</data>
  <name>Name for Dataset</name>
  <author>Dataset\'s Author</author>
  <description class="string">
    <sr>Include a multi-line string here</sr>
    <sr>Use as many sr tags as you have lines</sr>
  </description>
  <axisscale>
    <set>
      <mode>2</mode>
      <name>example axis scale</name>
      <content class="numeric"> 
         500,502,504,506,508</content>
    </set>
  </axisscale>
  <label>
    <set>
      <mode>2</mode>
      <name>example variable labels</name>
      <content class="string">
        <sr>A</sr>
        <sr>B</sr>
        <sr>C</sr> 
        <sr>D</sr> 
        <sr>E</sr> 
      </content>
    </set>
    <set>
      <mode>1</mode>
      <name>example sample label</name>
      <content>This is my one sample</content>
    </set>
  </label>
</obj>

Please contact Eigenvector Research for more information on DSO XML format, if needed.