Gaselctr: Difference between revisions

Latest revision as of 14:58, 7 November 2016

Purpose

Genetic algorithm for variable selection with PLS.

Synopsis

model = gaselctr(x,y,options)

[fit,pop,cavfit,cbfit] = gaselctr(x,y,options)

Description

GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.

Inputs

x = the predictor block (x-block), and
y = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).

Outputs

model = a standard GENALG model structure with the following fields:
- modeltype: 'GENALG' This field will always have this value.
- datasource: {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from.
- date: date stamp for when GASELCTR was run.
- time: time stamp for when GASELCTR was run.
- info: 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained.
- rmsecv: fitness results for each member of the population, for X MxN and Mp unique populations at convergence then rmsecv will be 1xMp.
- icol: each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X MxN and Mp unique populations at convergence then icol will be MpxN, and
- detail: [1x1 struct], a structure array containing model details including the following fields:
  - avefit: the average fitness at each generation.
  - bestfit: the best fitness at each generation, and
  - options: a structure corresponding to the options discussed above.

For the second output syntax shown above,

fit is the same as model.rmsecv
pop is the same as model.icol
cavfit is the same as model.detail.avefit
cbfit is the same as model.detail.bestfit

Options

options is a structure array with the following fields:

plots: ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
- 'final' gives only a final summary plot.
- 'replicates' gives plots at the end of each replicate.
- 'intermediate' gives plots during analysis.
- 'none' gives no plots.
display: [{'on'}| 'off' ] governs output to the command window.
popsize: {64} the population size (16<popsize<256 and popsize must be divisible by 4),
maxgenerations: {100} the maximum number of generations (25<mg<500),
mutationrate: {0.005} the mutation rate (typically 0.001<mt<0.01),
windowwidth: {1} the number of variables in a window (integer window width),
convergence: {50} percent of population the same at convergence (typically cn=80),
initialterms: {30} percent terms included at initiation (10<it<50),
crossover: {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
algorithm: [ 'mlr' | {'pls'} ] regression algorithm,
ncomp: {10} maximum number of latent variables for PLS models,
cv: [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
split: {5} number of subsets to divide data into for cross-validation,
iter: {1} number of iterations for cross-validation at each generation,
preprocessing: {[ ] [ ]} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
preapply: [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but may reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
reps: {1} the number of replicate runs to perform,
target: a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:

penaltyslope*(target_min-n) when n<target_min, or

penaltyslope*(n-target_max) when n>target_max.

Field target is used to bias models towards a given range of included variables (see penaltyslope below),

targetpct: {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
penaltyslope: {0} the slope of the penalty function (see target above).

Examples

To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:

  x2 = mncn(x);
  y2 = mncn(y);
  [fit,pop] = gaselctr(x2,y2);

To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:

  
  options = gaselctr('options');
  options.preprocessing{1} = preprocess('default', 'mean center');
  options.preprocessing{2} = preprocess('default', 'mean center');
  [fit,pop] = gaselctr(x2,y2,options);

@@ Line 1: / Line 1: @@
 ===Purpose===
@@ Line 6: / Line 5: @@
 ===Synopsis===
-:model = gaselctr(x,y,options)
+:model = gaselctr(x,y,''options'')
-:[fit,pop,avefit,bstfit] = gaselctr(x,y,''options'')
+:[fit,pop,cavfit,cbfit] = gaselctr(x,y,''options'')
 ===Description===
@@ Line 16: / Line 15: @@
 * '''x''' = the predictor block (x-block), and
 * '''y''' = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).
-===Options===
+====Outputs====
-* '''''options''''' = a structure array with the following fields:
+* '''model''' = a standard GENALG model structure with the following fields:
+** '''modeltype''': 'GENALG' This field will always have this value.
+** '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from.
+** '''date''': date stamp for when GASELCTR was run.
+** '''time''':  time stamp for when GASELCTR was run.
+** '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained.
+** '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp''.
+** '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and
+** '''detail''': [1x1 struct], a structure array containing model details including the following fields:
+***  '''avefit''': the average fitness at each generation.
+***  '''bestfit''': the best fitness at each generation, and
+***  '''options''': a structure corresponding to the options discussed above.
-* '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
+For the second output syntax shown above,
-*  ''''final'''' gives only a final summary plot.
+* '''fit''' is the same as <tt>model.rmsecv</tt>
+* '''pop''' is the same as <tt>model.icol</tt>
+* '''cavfit''' is the same as <tt>model.detail.avefit</tt>
+* '''cbfit''' is the same as <tt>model.detail.bestfit</tt>
-*  ''''replicates'''' gives plots at the end of each replicate.
+===Options===
-*  ''''intermediate'''' gives plots during analysis.
-*  ''''none'''' gives no plots.
-* '''popsize''': {64} the population size (16?popsize?256 and popsize must be divisible by 4),
-* '''maxgenerations''': {100} the maximum number of generations (25?mg?500),
-* '''mutationrate''': {0.005} the mutation rate (typically 0.001?mt?0.01),
+''options'' is a structure array with the following fields:
+* '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
+**  ''''final'''' gives only a final summary plot.
+**  ''''replicates'''' gives plots at the end of each replicate.
+**  ''''intermediate'''' gives plots during analysis.
+**  ''''none'''' gives no plots.
+* '''display''': [{'on'}| 'off' ] governs output to the command window.
+* '''popsize''': {64} the population size (16<u><</u>popsize<u><</u>256 and popsize must be divisible by 4),
+* '''maxgenerations''': {100} the maximum number of generations (25<u><</u>mg<u><</u>500),
+* '''mutationrate''': {0.005} the mutation rate (typically 0.001<u><</u>mt<u><</u>0.01),
 * '''windowwidth''': {1} the number of variables in a window (integer window width),
 * '''convergence''': {50} percent of population the same at convergence (typically cn=80),
+* '''initialterms''': {30} percent terms included at initiation (10<u><</u>it<u><</u>50),
-* '''initialterms''': {30} percent terms included at initiation (10?bf?50),
 * '''crossover''': {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
 * '''algorithm''': [ 'mlr' | {'pls'} ] regression algorithm,
 * '''ncomp''': {10} maximum number of latent variables for PLS models,
 * '''cv''': [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
 * '''split''': {5} number of subsets to divide data into for cross-validation,
 * '''iter''': {1} number of iterations for cross-validation at each generation,
+* '''preprocessing''': {[ ] [ ]} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
-* '''preprocessing''': {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
+* '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but may reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
-* '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
 * '''reps''': {1} the number of replicate runs to perform,
 * '''target''': a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
+:   <tt>penaltyslope*(target_min-n)</tt> when n<target_min, or
-*   '''penaltyslope\*(target_min-n)''' when n<target_min, or
+:   <tt>penaltyslope*(n-target_max)</tt> when n>target_max.
+:  Field <tt>target</tt> is used to bias models towards a given range of included variables (see penaltyslope below),
-*   '''penaltyslope\*(n-target_max)''' when n>target_max.
-*  '''Field''' target is used to bias models towards a given range of included variables (see penaltyslope below),
 * '''targetpct''': {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
 * '''penaltyslope''': {0} the slope of the penalty function (see target above).
-The default options can be retreived using: options = gaslctr('options');.
-====Outputs====
-* '''model''' = a standard GENALG model structure with the following fields:
-* '''modeltype''': 'GENALG' This field will always have this value,
-* '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from
-* '''date''': date stamp for when GASELCTR was run,
-* '''time''':  time stamp for when GASELCTR was run,
-* '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,
-* '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp'',
-* '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and
-* '''detail''': [1x1 struct], a structure array containing model details including the following fields:
-*  '''avefit''': the average fitness at each generation,
-*  '''bestfit''': the best fitness at each generation, and
-*  '''options''': a structure corresponding to the options discussed above.
 ===Examples===
@@ Line 107: / Line 74: @@
 To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:
-x2 = mncn(x);
+  <pre>
+  x2 = mncn(x);
-:y2 = mncn(y);
+  y2 = mncn(y);
+  [fit,pop] = gaselctr(x2,y2);</pre>
-[fit,pop] = gaselctr(x2,y2);
 To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:
-options = gaselctr('options');
+<pre>
+  options = gaselctr('options');
-:options.preprocessing{1} = preprocess('default', 'mean center');
+  options.preprocessing{1} = preprocess('default', 'mean center');
+  options.preprocessing{2} = preprocess('default', 'mean center');
-:options.preprocessing{2} = preprocess('default', 'mean center');
+  [fit,pop] = gaselctr(x2,y2,options);
+</pre>
-[fit,pop] = gaselctr(x2,y2,options);
 ===See Also===
-[[calibsel]], [[fullsearch]], [[genalg]], [[genalgplot]]
+[[calibsel]], [[fullsearch]], [[genalg]], [[genalgplot]], [[ipls]],[[Genetic Algorithms for Variable Selection]]

Gaselctr: Difference between revisions

Latest revision as of 14:58, 7 November 2016

Contents

Purpose

Synopsis

Description

Inputs

Outputs

Options

Examples

See Also

Navigation menu

Gaselctr: Difference between revisions

Latest revision as of 14:58, 7 November 2016

Purpose

Synopsis

Description

Inputs

Outputs

Options

Examples

See Also

Navigation menu

Search