DataSet Construction

From Eigenvector Research Documentation Wiki
Revision as of 12:30, 23 May 2016 by imported>Mathias (→‎From a GUI)
Jump to navigation Jump to search

Getting Started

In general, data is stored in a dataset object.



From a GUI

Using PLS_Toolbox and Solo, it is easy to import data into a dataset object using the data importer. From the workspace browser select File/Import Data to launch the GUI.


Alternatively this can be acheived by dragging the desired file into the Workspace Browser. In the case of text based file formats such as CSV, this will launch the following window.

Text Import.png

This window will allow the user to choose options specefic to this file, such as the number of header rows to ignore, and which delimiter to use. Clicking OK will launch the data import tool pictured below. The user can specify which columns and rows will be used as the datasets axisscales and labels. In this example, the first row and the second column have been specefied as axisscales.

Import.png




From the MATLAB Command Line

Datasets can be created by passing an array to the dataset function. In this example we will use data field from the wine demo dataset.


>> load wine
>> dat   = wine.data;
>> names = wine.label{1};
>> var   = wine.label{2};
>> whos
  Name        Size            Bytes  Class      Attributes

  dat        10x5               400  double              
  names      10x6               120  char                
  var         5x6                60  char                
  wine       10x5             12156  dataset  

The variable 'dat' contains the data array corresponding to the 5 variables wine, beer, and liquor consumption, life expectancy, and heart disease for 10 samples (countries).

The country names are contained in the variable 'names' and the variable names are contained in 'vars'. The next step creates a DataSet object, gives it a name, authorship, and description.

»wined = dataset(dat);
»wined.name = 'Wine';
»wined.author = 'A.E. Newman';
»wined.description= ...
{'Wine, beer, and liquor consumption (gal/yr)',...
'life expectancy (years), and heart disease rate', ...
'(cases/100,/yr) for 10 countries.'};
»wined.label{1} = names;
»wined.label{2} = vars;

Additional assignments can also be made. Here the label for the first mode (rows) is shown explicitly next to the data array (like sample labels). Also, titles, axis, and titles are assigned.

»wined.labelname{1} = 'Countries';
»wined.label{1} = ...
{'France' ...
'Italy', ...
'Switz', ...
'Austra', ...
...
'Mexico'};
»wined.title{1} = 'Country';
»wined.class{1} = [1 1 1 2 3];
»wined.classname{1} = 'Continent';
»wined.axisscale{1} = 1:5;
»wined.axisscalename{1} = 'Country Number';

Additional assignments can also be made for mode 2. Here the label for the second mode (columns) is shown explicitly above the data array (like column headings). Also, titles, axis, and titles are assigned.

»wined.labelname{2} = 'Variables';
»wined.label{2} = ...
{'Liquor','Wine','Beer','LifeExp','HeartD'};

If the data matrix is N-way the assignment process can be extended to Mode 3, Mode 4, ... Mode N. It can also be extended to using multiple sets of labels and axis scales e.g.

»wined.labelname{2,2} = 'Alcohol Content and Quality';
»wined.label{2,2} = {'high','medium','low','good','bad'};

An individual label can be replaced by further indexing into a given label set using curly braces followed by the string replacement:

»wined.label{2,2}{4} = 'excellent';

Indexing Into DataSets

Sub-portions of the DataSet can be retrieved by indexing into the main DataSet object. For example, here the first three columns ('Liquor', 'Wine', and 'Beer') are extracted into a new DataSet named "alcohol":

»alcohol = wined(:,1:3);

Additionally, any field in the DataSet can also be indexed into directly. Here the second country name is pulled out of the labels by extracting the entire second row of the mode 1 labels:

»country2 = wined.label{1}(2,:);

Indexing using Labels and Classes

A shortcut to extract a subset of a DataSet is to index into the main DataSet object using labels and/or classes for the requested item(s). For example, to extract a DataSet containing only the Liquor values, you could use:

»alcohol = wined.liquor;

Note that the upper-case characters in the label do not matter. If the label or class starts with a number of contains any non-alphanumeric characters, you must enclose the label in parenthesis and quotes:

»alcohol = wined.('liquor');


Indexing with Class or Label Set Names

Class and label information can be extracted using the the "set name" and dot notation.

mylabels = wine.Country
mylabels =
France
Italy
Switz
Austra
Brit
U.S.A.
Russia
Czech
Japan
Mexico

Note that class names will be checked first, before label names.

Creating 3-Way Data

There are several ways to create 3-way data in PLS_Toolbox.

If the data is given in seperate text based files such as .csv, the data can easily be imported into a 3-way dataset using the Text import data tool. By dragging all files into the Workspace Browser and then selecting


3-way.png