DataSet Object Examples: Difference between revisions
imported>Jeremy |
imported>Benjamin No edit summary |
||
(2 intermediate revisions by one other user not shown) | |||
Line 99: | Line 99: | ||
The following shows an example using the 'wine' data set in the PLS_Toolbox. Other examples can be found in the <tt>datasetdemo.m</tt> script. | The following shows an example using the 'wine' data set in the PLS_Toolbox. Other examples can be found in the <tt>datasetdemo.m</tt> script. | ||
The first step in the example is to | The first step in the example is to create raw data from the 'wine' data set and examine the variables. The MATLAB commands are: | ||
<pre> | <pre> | ||
>> load wine | |||
>> dat = wine.data; | |||
Name | >> names = wine.label{1}; | ||
dat | >> var = wine.label{2}; | ||
names 10x6 | >> whos | ||
Name Size Bytes Class Attributes | |||
dat 10x5 400 double | |||
names 10x6 120 char | |||
var 5x6 60 char | |||
wine 10x5 12156 dataset | |||
</pre> | </pre> | ||
Line 171: | Line 176: | ||
<pre>»alcohol = wined.('liquor');</pre> | <pre>»alcohol = wined.('liquor');</pre> | ||
===Indexing with Class or Label Set Names=== | |||
Class and label information can be extracted using the the "set name" and dot notation. | |||
<pre>mylabels = wine.Country | |||
mylabels = | |||
France | |||
Italy | |||
Switz | |||
Austra | |||
Brit | |||
U.S.A. | |||
Russia | |||
Czech | |||
Japan | |||
Mexico | |||
</pre> | |||
Note that class names will be checked first, before label names. | |||
==Using the Class Lookup Table== | ==Using the Class Lookup Table== | ||
Line 233: | Line 259: | ||
[4] 'AN' | [4] 'AN' | ||
[6] 'Unknown'</pre> | [6] 'Unknown'</pre> | ||
===Extracting a Class=== | |||
In releases after version 8.2 of PLS_Toolbox/Solo, subsets can be extracted based on class names: | |||
<pre> | |||
>> mysubset = mydataset('myclass'); </pre> | |||
Where ''mydataset'' is a dataset which contains samples belonging to the designated classname ''myclass'', and ''mysubset'' contains a new dataset of samples from ''mydataset'' which are members of ''myclass'' exclusively. | |||
Moreover, it is also possible to extract only the samples marked as include via adding '''.include''': | |||
<pre> | |||
>> subset = dataset('myclass').include; </pre> | |||
Latest revision as of 10:20, 18 April 2017
DataSet Object Tour
Perhaps the best way to understand how DSOs work is to examine a couple of them. Several data sets are included with PLS_Toolbox, and all of them are in the form of DSOs. We will start with the smallest one, the Wine data set. Clear the MATLAB workspace (save anything important first!) and at the command line type:
>> load wine >> whos Name Size Bytes Class wine 10x5 6050 dataset object Grand total is 920 elements using 6050 bytes
We have now loaded the Wine data set. When the whos command is used, we see that there is a single variable in the workspace, wine, of Class dataset object, with a data field that is 10 by 5. We can look at the contents of wine by typing:
>> wine wine = name: Wine type: data author: B.M. Wise date: 14-May-2001 13:47:54 moddate: 06-Jun-2001 10:27:24 data: 10x5 [double] label: {2x1} [array (char)] Mode 1 [Country: 10x6] Mode 2 [Variable: 5x6] axisscale: {2x1} [vector (real)] Mode 1 [: ] Mode 2 [: ] title: {2x1} [vector (char)] Mode 1 [: 'Country'] Mode 2 [: 'Variable'] class: {2x1} [vector (integer)] Mode 1 [: ] Mode 2 [: ] include: {2x1} [vector (integer)] Mode 1 [: 1x10] Mode 2 [: 1x5] description: Wine, beer, and liquor consumption (gal/yr), life life expectancy (years), and heart disease rate (cases/100,000/yr) for 10 countries. history: {1x1 cell} [array (char)] userdata:
From this we see that the name of the data is Wine and that the type is “data.” Other types are also possible, such as “image” and “batch.” The author is listed, followed by the creation date and last-modified date. The next field, data, contains the actual data table. These data, or the data from any DSO field, can be extracted from the DSO just as they would be from a conventional structure array (type help struct or refer to the Examining a Structure Array section of Chapter 2 for help) using DSOname.fieldname syntax. For instance:
>> wine.data ans = 2.5000 63.5000 40.1000 78.0000 61.1000 0.9000 58.0000 25.1000 78.0000 94.1000 1.7000 46.0000 65.0000 78.0000 106.4000 1.2000 15.7000 102.1000 78.0000 173.0000 1.5000 12.2000 100.0000 77.0000 199.7000 2.0000 8.9000 87.8000 76.0000 176.0000 3.8000 2.7000 17.1000 69.0000 373.6000 1.0000 1.7000 140.0000 73.0000 283.7000 2.1000 1.0000 55.0000 79.0000 34.7000 0.8000 0.2000 50.4000 73.0000 36.4000
The labels can be extracted in a similar manner:
>> wine.label ans = [10x6] char [ 5x6] char
Note that ans is a cell array, i.e., the labels for each mode of the array are stored in a cell that is indexed to that mode. Thus, the labels for mode 1, the data set rows, can be extracted with:
>> wine.label{1} ans = France Italy Switz Austra Brit U.S.A. Russia Czech Japan Mexico
Note that curly brackets, {}, are used to index into cell arrays (type help cell for more information on cell arrays). In a similar way the labels for mode 2, the data set columns, can be extracted by executing:
>> wine.label{2} ans = Liquor Wine Beer LifeEx HeartD
Other fields in the DSO include .axisscale (e.g., time or wavelength scale), .title (titles for the axes), and .class (e.g., class variables for samples). Note that a typical data set will not have all of the available fields filled. The Wine data set does not have axis scales, for instance, nor class variables.
DSOs also allow for multiple sets of many of these fields; for instance, you may store more than one set of labels for a particular mode. Most GUI tools including Analysis, PlotGUI and the DataSet Editor support multiple sets but there are some rare situations where their use has not yet been fully implemented. GUIs allow limited support of multiple sets; axis scales and titles are not yet completely supported in the main Analysis GUI.
The user is encouraged to explore more of the DSOs included with PLS_Toolbox. For an example with axis scales, please see spec1 in the nir_data.mat file. For an example with class variables, please see arch in the arch.mat file.
Creating A DataSet Object
The following shows an example using the 'wine' data set in the PLS_Toolbox. Other examples can be found in the datasetdemo.m script.
The first step in the example is to create raw data from the 'wine' data set and examine the variables. The MATLAB commands are:
>> load wine >> dat = wine.data; >> names = wine.label{1}; >> var = wine.label{2}; >> whos Name Size Bytes Class Attributes dat 10x5 400 double names 10x6 120 char var 5x6 60 char wine 10x5 12156 dataset
The variable 'dat' contains the data array corresponding to the 5 variables wine, beer, and liquor consumption, life expectancy, and heart disease for 10 samples (countries).
The country names are contained in the variable 'names' and the variable names are contained in 'vars'. The next step creates a DataSet object, gives it a name, authorship, and description.
»wined = dataset(dat); »wined.name = 'Wine'; »wined.author = 'A.E. Newman'; »wined.description= ... {'Wine, beer, and liquor consumption (gal/yr)',... 'life expectancy (years), and heart disease rate', ... '(cases/100,/yr) for 10 countries.'}; »wined.label{1} = names; »wined.label{2} = vars;
Additional assignments can also be made. Here the label for the first mode (rows) is shown explicitly next to the data array (like sample labels). Also, titles, axis, and titles are assigned.
»wined.labelname{1} = 'Countries'; »wined.label{1} = ... {'France' ... 'Italy', ... 'Switz', ... 'Austra', ... ... 'Mexico'}; »wined.title{1} = 'Country'; »wined.class{1} = [1 1 1 2 3]; »wined.classname{1} = 'Continent'; »wined.axisscale{1} = 1:5; »wined.axisscalename{1} = 'Country Number';
Additional assignments can also be made for mode 2. Here the label for the second mode (columns) is shown explicitly above the data array (like column headings). Also, titles, axis, and titles are assigned.
»wined.labelname{2} = 'Variables'; »wined.label{2} = ... {'Liquor','Wine','Beer','LifeExp','HeartD'};
If the data matrix is N-way the assignment process can be extended to Mode 3, Mode 4, ... Mode N. It can also be extended to using multiple sets of labels and axis scales e.g.
»wined.labelname{2,2} = 'Alcohol Content and Quality'; »wined.label{2,2} = {'high','medium','low','good','bad'};
An individual label can be replaced by further indexing into a given label set using curly braces followed by the string replacement:
»wined.label{2,2}{4} = 'excellent';
Indexing Into DataSets
Sub-portions of the DataSet can be retrieved by indexing into the main DataSet object. For example, here the first three columns ('Liquor', 'Wine', and 'Beer') are extracted into a new DataSet named "alcohol":
»alcohol = wined(:,1:3);
Additionally, any field in the DataSet can also be indexed into directly. Here the second country name is pulled out of the labels by extracting the entire second row of the mode 1 labels:
»country2 = wined.label{1}(2,:);
Indexing using Labels and Classes
A shortcut to extract a subset of a DataSet is to index into the main DataSet object using labels and/or classes for the requested item(s). For example, to extract a DataSet containing only the Liquor values, you could use:
»alcohol = wined.liquor;
Note that the upper-case characters in the label do not matter. If the label or class starts with a number of contains any non-alphanumeric characters, you must enclose the label in parenthesis and quotes:
»alcohol = wined.('liquor');
Indexing with Class or Label Set Names
Class and label information can be extracted using the the "set name" and dot notation.
mylabels = wine.Country mylabels = France Italy Switz Austra Brit U.S.A. Russia Czech Japan Mexico
Note that class names will be checked first, before label names.
Using the Class Lookup Table
Classes in a DataSet are stored as numeric values (in the .class field). Each numeric value and be associated with a text value using the classlookup field. This field contains a simple nx2 table with numeric values in the first column and string values in the second. For example, the arch dataset has a classlookkup table for elements:
>> load arch >> arch.classlookup{1,1} ans = [0] 'Class 0' [1] 'K' [2] 'BL' [3] 'SH' [4] 'AN'
There are two basic ways to alter a classlookup table, by extracting the table or directly accessing the table.
Extracting the Table
Extracting the table can be useful if you need to perform several changes:
a = arch.classlookup{1}; a{3,2} = 'YYY'; a{4,2} = 'ZZZ'; arch.classlookup{1} = a; arch.classlookup{1} ans = [0] 'Class 0' [1] 'K' [2] 'YYY' [3] 'ZZZ' [4] 'AN'
Direct Access
If you just need to change a single value (or a small number of values) you can directly access the lookup table using .assignstr and .assignval fields. For example, the change class 0 of arch from "Class 0" to "Unknown":
arch.classlookup{1}.assignstr = {0 'Unknown'} >> arch.classlookup{1} ans = [0] 'Unknown' [1] 'K' [2] 'BL' [3] 'SH' [4] 'AN'
Then, to change the numeric value of "Unknown" from 0 to 6:
>> arch.classlookup{1} arch.classlookup{1}.assignval = {6 'Unknown'} ans = [0] 'Class 0' [1] 'K' [2] 'BL' [3] 'SH' [4] 'AN' [6] 'Unknown'
Extracting a Class
In releases after version 8.2 of PLS_Toolbox/Solo, subsets can be extracted based on class names:
>> mysubset = mydataset('myclass');
Where mydataset is a dataset which contains samples belonging to the designated classname myclass, and mysubset contains a new dataset of samples from mydataset which are members of myclass exclusively.
Moreover, it is also possible to extract only the samples marked as include via adding .include:
>> subset = dataset('myclass').include;
Using Image DataSets
The DataSet Object contains functionality for handling image data. In image DataSet objects, the .data field contains "unfolded" image data. Image data is usually contained in a 2nd or higher-order matrix in which several modes are used to describe a spatial relationship between pieces of information. For example, many standard JPEG images are three-way images of size M x N x 3. The first two dimensions are the spatial dimensions in that the actual image is M pixels high by N pixels wide. The third dimension is the wavelength dimension and, in this case, contains 3 slabs – one for each of the Red, Green, Blue image components.
Working with such image data in DataSet objects is made easier by unfolding such multi-way images so that all the spatial modes are stacked on top of each other in a single mode. Unfolding is done so that all the spatial information is contained in a single mode and can be handled together – often so that each pixel can be analyzed as an individual sample (or even sometimes as variables). Individual pixels in an unfolded image are independent and can be individually included and excluded (see .include field) or assigned particular classes (see .class field), for example. In the case of the JPEG mentioned above, the unfolded image would be stored as an (MN) x 3 matrix where the first mode was M times N elements (i.e. pixels) in size.
The .imagemode field contains a scalar value indicating which mode of the .data field contains the spatial information. In the JPEG example, .imagemode would be 1 (one). Similarly the .imagesize field contains a vector describing the original size of the spatial mode before unfolding. The JPEG example would contain the two-element vector: [M N] Note: that .imagesize contains only the size of the image mode, not the entire data matrix. Note that the product of the .imagesize field must be equal to the size of the .imagemode mode of the .data field. That is, the number of pixels contained in the spatial mode of the data must be appropriate that it can be reshaped into a matrix of size .imagesize. See the .foldedsize field for the size of entire folded matrix.
The .imagedata field is a special read-only field which returns the contents of the .data field refolded back into the original image-sized matrix. In the JPEG example, the .data field would return a matrix of size (MN) x 3 (the unfolded image) but the .imagedata field would return the original M x N x 3 matrix. Any changes you make to the contents of the .data field will automatically be reflected in the contents returned by .imagedata. .imagedata, however, can not be written to.
The following example uses functions found in MIA_Toolbox.
Build an image dataset from the demonstration data "EchoRidgeClouds.jpeg" included with MIA_Toolbox.
>> dat = imread('EchoRidgeClouds.jpeg','jpeg'); size(dat) ans = 768 512 3 >> imgdso = buildimage(dat, [1 2], 1) imgdso = name: type: image author: date: 08-Sep-2009 21:21:13 moddate: 08-Sep-2009 21:21:13 data: 393216x3 [double] imagesize: 768x512 imagemode: 1 label: {2x1} [array (char)] Mode 1 [: ] Mode 2 [: ] axisscale: {2x1} [vector (real)] Mode 1 [: ] (none) Mode 2 [: ] (none) title: {2x1} [vector (char)] Mode 1 [: ] Mode 2 [: ] class: {2x1} [vector (double)] Mode 1 [: ] Mode 2 [: ] classid: {2x1} [cell of strings] include: {2x1} [vector (integer)] Mode 1 [: 1x393216] Mode 2 [: 1x3] description: history: {1x1 cell} [array (char)] userdata:
Now we can look at the size of data returned by each field:
>> mydata = imgdso.data; >> myimage = imgdso.imagedata; >> whos Name Size Bytes Class Attributes dat 768x512x3 1179648 uint8 imgdso 393216x3 12591234 dataset mydata 393216x3 9437184 double myimage 768x512x3 9437184 double
Notice the image data is returned as double. If we were to use the imagesc function in Matlab, we'd need to convert the data to uint8 before plotting. There are several image reading functions that return data in an image type DSO. When plotting image DSOs, plotgui will default to using the image (folded data) when appropriate.