DataSet Construction: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Mathias
imported>Mathias
Line 19: Line 19:
<br clear=all>
<br clear=all>


==From the MATLAB Command Line==
Datasets can be created by passing an array to the dataset function.  In this example we will use data field from the wine demo dataset.




From the command line, the easiest way to create a dataset is to pass an array to the dataset function. First we will create an array of data to be passed to the dataset function.   
<pre>
>> load wine
>> dat  = wine.data;
>> names = wine.label{1};
>> var   = wine.label{2};
>> whos
  Name        Size            Bytes  Class      Attributes


<pre>
  dat        10x5              400 double            
>> t    = [0:0.1:10]';
  names      10x6              120  char                 
>> x    = [cos(t) sin(t) exp(-t)];
   var        5x6               60 char                 
>> data = dataset(x)
  wine       10x5            12156 dataset  
   
data =
      name: x
      type: data
      date: 23-May-2016 11:24:53
    moddate: 23-May-2016 11:24:53
      data: 101x3 [double]
      label: {2x1} [array (char)]
              Mode 1  [: ]
               Mode 2  [: ]
   axisscale: {2x1} [vector (real)] (axistype)
              Mode 1  [: ] (none)
               Mode 2 [: ] (none)
      title: {2x1} [vector (char)]
               Mode 1  [: ]
              Mode 2  [: ]
       class: {2x1} [vector (double)]
              Mode 1 [: ]
              Mode 2  [: ]
    classid: {2x1} [cell of strings]
    include: {2x1} [vector (integer)]
              Mode 1  [: 1x101]
              Mode 2 [: 1x3] 
    history: {1x1 cell} [array (char)]
      OTHER: [View Class Summary]
</pre>
</pre>


Alternatively we could start with an empty dataset and assign the the array x to its data field.  
The variable '<tt>dat</tt>' contains the data array corresponding to the 5 variables wine, beer, and liquor consumption, life expectancy, and heart disease for 10 samples (countries).
 
The country names are contained in the variable '<tt>names</tt>' and the variable names are contained in '<tt>vars</tt>'. The next step creates a <tt>DataSet</tt> object, gives it a name, authorship, and description.
 
<pre>»wined = dataset(dat);
»wined.name = 'Wine';
»wined.author = 'A.E. Newman';
»wined.description= ...
{'Wine, beer, and liquor consumption (gal/yr)',...
'life expectancy (years), and heart disease rate', ...
'(cases/100,/yr) for 10 countries.'};
»wined.label{1} = names;
»wined.label{2} = vars;</pre>
 
Additional assignments can also be made. Here the label for the first mode (rows) is shown explicitly next to the data array (like sample labels). Also, titles, axis, and titles are assigned.
 
<pre>»wined.labelname{1} = 'Countries';
»wined.label{1} = ...
{'France' ...
'Italy', ...
'Switz', ...
'Austra', ...
...
'Mexico'};
»wined.title{1} = 'Country';
»wined.class{1} = [1 1 1 2 3];
»wined.classname{1} = 'Continent';
»wined.axisscale{1} = 1:5;
»wined.axisscalename{1} = 'Country Number';</pre>
Additional assignments can also be made for mode 2. Here the label for the second mode (columns) is shown explicitly above the data array (like column headings). Also, titles, axis, and titles are assigned.
 
<pre>»wined.labelname{2} = 'Variables';
»wined.label{2} = ...
{'Liquor','Wine','Beer','LifeExp','HeartD'};</pre>
 
If the data matrix is N-way the assignment process can be extended to Mode 3, Mode 4, ... Mode N. It can also be extended to using multiple sets of labels and axis scales ''e.g.''
<pre>»wined.labelname{2,2} = 'Alcohol Content and Quality';
»wined.label{2,2} = {'high','medium','low','good','bad'};</pre>
An individual label can be replaced by further indexing into a given label set using curly braces followed by the string replacement:
 
<pre>»wined.label{2,2}{4} = 'excellent';</pre>
 
===Indexing Into DataSets===
 
Sub-portions of the <tt>DataSet</tt> can be retrieved by indexing into the main <tt>DataSet</tt> object. For example, here the first three columns ('Liquor', 'Wine', and 'Beer') are extracted into a new <tt>DataSet</tt> named "alcohol":
 
<pre>»alcohol = wined(:,1:3);</pre>
 
Additionally, any field in the <tt>DataSet</tt> can also be indexed into directly. Here the second country name is pulled out of the labels by extracting the entire second row of the mode 1 labels:
 
<pre>»country2 = wined.label{1}(2,:);</pre>
 
===Indexing using Labels and Classes===
 
A shortcut to extract a subset of a <tt>DataSet</tt> is to index into the main <tt>DataSet</tt> object using labels and/or classes for the requested item(s). For example, to extract a DataSet containing only the Liquor values, you could use:
 
<pre>»alcohol = wined.liquor;</pre>
 
Note that the upper-case characters in the label do not matter. If the label or class starts with a number of contains any non-alphanumeric characters, you must enclose the label in parenthesis and quotes:
 
<pre>»alcohol = wined.('liquor');</pre>


<pre>
newdata = dataset;
newdata.data = x;
</pre>


Similarly we may set the other fields of the dataset object individually.
===Indexing with Class or Label Set Names===
 
Class and label information can be extracted using the the "set name" and dot notation.


<pre>
<pre>mylabels = wine.Country
vars = {'cos(t)';'sin(t)';'exp(-t)'};
mylabels =
newdata.author  = 'Data Manager';        %sets the author field
France
newdata.label{2} = vars;                  %sets the labels for columns = dimension 2
Italy
newdata.labelname{2} = 'Variables';      %sets the name of the label for columns
Switz
newdata.axisscale{1} = t;                %sets the axis scale for rows = dimension 1
Austra
newdata.axisscalename{1} = 'Time';        %sets the name of the axis scale for rows
Brit
newdata.title{1}        = 'Time (s)';    %sets the title for rows
U.S.A.
newdata.titlename{1}    = 'Time Axis';  %sets the titlename for rows
Russia
newdata.title{2} = 'f(t)';                %sets the title for columns
Czech
newdata.titlename{2} = 'Functions';      %sets the titlename for columns
Japan
Mexico
</pre>
</pre>
Note that class names will be checked first, before label names.


==Creating 3-Way Data==
==Creating 3-Way Data==

Revision as of 12:30, 23 May 2016

Getting Started

In general, data is stored in a dataset object.



From a GUI

Using PLS_Toolbox and Solo, it is easy to import data into a dataset object using the data importer. From the workspace browser select File/Import Data to launch the GUI.


Alternatively this can be acheived by dragging the desired file into the Workspace Browser. In the case of text based file formats such as CSV, this will launch the following window.

Text Import.png

This window will allow the user to choose options specefic to this file, such as the number of header rows to ignore, and which delimiter to use. Clicking OK will launch the data import tool pictured below. The user can specify which columns and rows will be used as the datasets axisscales and labels. In this example, the first row and the second column have been specefied as axisscales.

Import.png



Datasets can be created by passing an array to the dataset function. In this example we will use data field from the wine demo dataset.


>> load wine
>> dat   = wine.data;
>> names = wine.label{1};
>> var   = wine.label{2};
>> whos
  Name        Size            Bytes  Class      Attributes

  dat        10x5               400  double              
  names      10x6               120  char                
  var         5x6                60  char                
  wine       10x5             12156  dataset  

The variable 'dat' contains the data array corresponding to the 5 variables wine, beer, and liquor consumption, life expectancy, and heart disease for 10 samples (countries).

The country names are contained in the variable 'names' and the variable names are contained in 'vars'. The next step creates a DataSet object, gives it a name, authorship, and description.

»wined = dataset(dat);
»wined.name = 'Wine';
»wined.author = 'A.E. Newman';
»wined.description= ...
{'Wine, beer, and liquor consumption (gal/yr)',...
'life expectancy (years), and heart disease rate', ...
'(cases/100,/yr) for 10 countries.'};
»wined.label{1} = names;
»wined.label{2} = vars;

Additional assignments can also be made. Here the label for the first mode (rows) is shown explicitly next to the data array (like sample labels). Also, titles, axis, and titles are assigned.

»wined.labelname{1} = 'Countries';
»wined.label{1} = ...
{'France' ...
'Italy', ...
'Switz', ...
'Austra', ...
...
'Mexico'};
»wined.title{1} = 'Country';
»wined.class{1} = [1 1 1 2 3];
»wined.classname{1} = 'Continent';
»wined.axisscale{1} = 1:5;
»wined.axisscalename{1} = 'Country Number';

Additional assignments can also be made for mode 2. Here the label for the second mode (columns) is shown explicitly above the data array (like column headings). Also, titles, axis, and titles are assigned.

»wined.labelname{2} = 'Variables';
»wined.label{2} = ...
{'Liquor','Wine','Beer','LifeExp','HeartD'};

If the data matrix is N-way the assignment process can be extended to Mode 3, Mode 4, ... Mode N. It can also be extended to using multiple sets of labels and axis scales e.g.

»wined.labelname{2,2} = 'Alcohol Content and Quality';
»wined.label{2,2} = {'high','medium','low','good','bad'};

An individual label can be replaced by further indexing into a given label set using curly braces followed by the string replacement:

»wined.label{2,2}{4} = 'excellent';

Indexing Into DataSets

Sub-portions of the DataSet can be retrieved by indexing into the main DataSet object. For example, here the first three columns ('Liquor', 'Wine', and 'Beer') are extracted into a new DataSet named "alcohol":

»alcohol = wined(:,1:3);

Additionally, any field in the DataSet can also be indexed into directly. Here the second country name is pulled out of the labels by extracting the entire second row of the mode 1 labels:

»country2 = wined.label{1}(2,:);

Indexing using Labels and Classes

A shortcut to extract a subset of a DataSet is to index into the main DataSet object using labels and/or classes for the requested item(s). For example, to extract a DataSet containing only the Liquor values, you could use:

»alcohol = wined.liquor;

Note that the upper-case characters in the label do not matter. If the label or class starts with a number of contains any non-alphanumeric characters, you must enclose the label in parenthesis and quotes:

»alcohol = wined.('liquor');


Indexing with Class or Label Set Names

Class and label information can be extracted using the the "set name" and dot notation.

mylabels = wine.Country
mylabels =
France
Italy
Switz
Austra
Brit
U.S.A.
Russia
Czech
Japan
Mexico

Note that class names will be checked first, before label names.

Creating 3-Way Data

There are several ways to create 3-way data in PLS_Toolbox.

If the data is given in seperate text based files such as .csv, the data can easily be imported into a 3-way dataset using the Text import data tool. By dragging all files into the Workspace Browser and then selecting


3-way.png