Gaselctr: Difference between revisions

From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
imported>Jeremy
(Importing text file)
imported>Mathias
 
(6 intermediate revisions by 3 users not shown)
Line 1: Line 1:
===Purpose===
===Purpose===


Line 6: Line 5:
===Synopsis===
===Synopsis===


:model = gaselctr(x,y,options)  
:model = gaselctr(x,y,''options'')  
:[fit,pop,avefit,bstfit] = gaselctr(x,y,''options'')
:[fit,pop,cavfit,cbfit] = gaselctr(x,y,''options'')


===Description===
===Description===
Line 16: Line 15:


* '''x''' = the predictor block (x-block), and
* '''x''' = the predictor block (x-block), and
* '''y''' = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).
* '''y''' = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).


===Options===
====Outputs====


* '''''options''''' = a structure array with the following fields:
* '''model''' = a standard GENALG model structure with the following fields:
** '''modeltype''': 'GENALG' This field will always have this value.
** '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from.
** '''date''': date stamp for when GASELCTR was run.
** '''time''':  time stamp for when GASELCTR was run.
** '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained.
** '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp''.
** '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and
** '''detail''': [1x1 struct], a structure array containing model details including the following fields:
***  '''avefit''': the average fitness at each generation.
***  '''bestfit''': the best fitness at each generation, and
***  '''options''': a structure corresponding to the options discussed above.


* '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
For the second output syntax shown above,


* ''''final'''' gives only a final summary plot.
* '''fit''' is the same as <tt>model.rmsecv</tt>
* '''pop''' is the same as <tt>model.icol</tt>
* '''cavfit''' is the same as <tt>model.detail.avefit</tt>
* '''cbfit''' is the same as <tt>model.detail.bestfit</tt>


*  ''''replicates'''' gives plots at the end of each replicate.
===Options===
 
*  ''''intermediate'''' gives plots during analysis.
 
*  ''''none'''' gives no plots.
 
* '''popsize''': {64} the population size (16?popsize?256 and popsize must be divisible by 4),
 
* '''maxgenerations''': {100} the maximum number of generations (25?mg?500),
 
* '''mutationrate''': {0.005} the mutation rate (typically 0.001?mt?0.01),


''options'' is a structure array with the following fields:
* '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
**  ''''final'''' gives only a final summary plot.
**  ''''replicates'''' gives plots at the end of each replicate.
**  ''''intermediate'''' gives plots during analysis.
**  ''''none'''' gives no plots.
* '''display''': [{'on'}| 'off' ] governs output to the command window.
* '''popsize''': {64} the population size (16<u><</u>popsize<u><</u>256 and popsize must be divisible by 4),
* '''maxgenerations''': {100} the maximum number of generations (25<u><</u>mg<u><</u>500),
* '''mutationrate''': {0.005} the mutation rate (typically 0.001<u><</u>mt<u><</u>0.01),
* '''windowwidth''': {1} the number of variables in a window (integer window width),
* '''windowwidth''': {1} the number of variables in a window (integer window width),
* '''convergence''': {50} percent of population the same at convergence (typically cn=80),
* '''convergence''': {50} percent of population the same at convergence (typically cn=80),
 
* '''initialterms''': {30} percent terms included at initiation (10<u><</u>it<u><</u>50),
* '''initialterms''': {30} percent terms included at initiation (10?bf?50),
 
* '''crossover''': {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
* '''crossover''': {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
* '''algorithm''': [ 'mlr' | {'pls'} ] regression algorithm,
* '''algorithm''': [ 'mlr' | {'pls'} ] regression algorithm,
* '''ncomp''': {10} maximum number of latent variables for PLS models,
* '''ncomp''': {10} maximum number of latent variables for PLS models,
* '''cv''': [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
* '''cv''': [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
* '''split''': {5} number of subsets to divide data into for cross-validation,
* '''split''': {5} number of subsets to divide data into for cross-validation,
* '''iter''': {1} number of iterations for cross-validation at each generation,
* '''iter''': {1} number of iterations for cross-validation at each generation,
 
* '''preprocessing''': {[ ] [ ]} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
* '''preprocessing''': {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
* '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but may reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
 
* '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
 
* '''reps''': {1} the number of replicate runs to perform,
* '''reps''': {1} the number of replicate runs to perform,
* '''target''': a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:  
* '''target''': a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:  
 
:   <tt>penaltyslope*(target_min-n)</tt> when n<target_min, or
*   '''penaltyslope\*(target_min-n)''' when n<target_min, or
:   <tt>penaltyslope*(n-target_max)</tt> when n>target_max.
 
: Field <tt>target</tt> is used to bias models towards a given range of included variables (see penaltyslope below),
*   '''penaltyslope\*(n-target_max)''' when n>target_max.
 
* '''Field''' target is used to bias models towards a given range of included variables (see penaltyslope below),
 
* '''targetpct''': {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
* '''targetpct''': {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
* '''penaltyslope''': {0} the slope of the penalty function (see target above).
* '''penaltyslope''': {0} the slope of the penalty function (see target above).
The default options can be retreived using: options = gaslctr('options');.
====Outputs====
* '''model''' = a standard GENALG model structure with the following fields:
* '''modeltype''': 'GENALG' This field will always have this value,
* '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from
* '''date''': date stamp for when GASELCTR was run,
* '''time''':  time stamp for when GASELCTR was run,
* '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,
* '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp'',
* '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and
* '''detail''': [1x1 struct], a structure array containing model details including the following fields:
*  '''avefit''': the average fitness at each generation,
*  '''bestfit''': the best fitness at each generation, and
*  '''options''': a structure corresponding to the options discussed above.


===Examples===
===Examples===
Line 107: Line 74:
To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:
To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:


x2 = mncn(x);
  <pre>
 
  x2 = mncn(x);
:y2 = mncn(y);
  y2 = mncn(y);
 
  [fit,pop] = gaselctr(x2,y2);</pre>
[fit,pop] = gaselctr(x2,y2);


To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:
To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:


options = gaselctr('options');
<pre> 
 
  options = gaselctr('options');
:options.preprocessing{1} = preprocess('default', 'mean center');
  options.preprocessing{1} = preprocess('default', 'mean center');
 
  options.preprocessing{2} = preprocess('default', 'mean center');
:options.preprocessing{2} = preprocess('default', 'mean center');
  [fit,pop] = gaselctr(x2,y2,options);
 
</pre>
[fit,pop] = gaselctr(x2,y2,options);
 
===See Also===
===See Also===


[[calibsel]], [[fullsearch]], [[genalg]], [[genalgplot]]
[[calibsel]], [[fullsearch]], [[genalg]], [[genalgplot]], [[ipls]],[[Genetic Algorithms for Variable Selection]]

Latest revision as of 14:58, 7 November 2016

Purpose

Genetic algorithm for variable selection with PLS.

Synopsis

model = gaselctr(x,y,options)
[fit,pop,cavfit,cbfit] = gaselctr(x,y,options)

Description

GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.

Inputs

  • x = the predictor block (x-block), and
  • y = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).

Outputs

  • model = a standard GENALG model structure with the following fields:
    • modeltype: 'GENALG' This field will always have this value.
    • datasource: {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from.
    • date: date stamp for when GASELCTR was run.
    • time: time stamp for when GASELCTR was run.
    • info: 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained.
    • rmsecv: fitness results for each member of the population, for X MxN and Mp unique populations at convergence then rmsecv will be 1xMp.
    • icol: each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X MxN and Mp unique populations at convergence then icol will be MpxN, and
    • detail: [1x1 struct], a structure array containing model details including the following fields:
      • avefit: the average fitness at each generation.
      • bestfit: the best fitness at each generation, and
      • options: a structure corresponding to the options discussed above.

For the second output syntax shown above,

  • fit is the same as model.rmsecv
  • pop is the same as model.icol
  • cavfit is the same as model.detail.avefit
  • cbfit is the same as model.detail.bestfit

Options

options is a structure array with the following fields:

  • plots: ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
    • 'final' gives only a final summary plot.
    • 'replicates' gives plots at the end of each replicate.
    • 'intermediate' gives plots during analysis.
    • 'none' gives no plots.
  • display: [{'on'}| 'off' ] governs output to the command window.
  • popsize: {64} the population size (16<popsize<256 and popsize must be divisible by 4),
  • maxgenerations: {100} the maximum number of generations (25<mg<500),
  • mutationrate: {0.005} the mutation rate (typically 0.001<mt<0.01),
  • windowwidth: {1} the number of variables in a window (integer window width),
  • convergence: {50} percent of population the same at convergence (typically cn=80),
  • initialterms: {30} percent terms included at initiation (10<it<50),
  • crossover: {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
  • algorithm: [ 'mlr' | {'pls'} ] regression algorithm,
  • ncomp: {10} maximum number of latent variables for PLS models,
  • cv: [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
  • split: {5} number of subsets to divide data into for cross-validation,
  • iter: {1} number of iterations for cross-validation at each generation,
  • preprocessing: {[ ] [ ]} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
  • preapply: [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but may reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
  • reps: {1} the number of replicate runs to perform,
  • target: a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
penaltyslope*(target_min-n) when n<target_min, or
penaltyslope*(n-target_max) when n>target_max.
Field target is used to bias models towards a given range of included variables (see penaltyslope below),
  • targetpct: {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
  • penaltyslope: {0} the slope of the penalty function (see target above).

Examples

To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:

  x2 = mncn(x);
  y2 = mncn(y);
  [fit,pop] = gaselctr(x2,y2);

To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:

  
  options = gaselctr('options');
  options.preprocessing{1} = preprocess('default', 'mean center');
  options.preprocessing{2} = preprocess('default', 'mean center');
  [fit,pop] = gaselctr(x2,y2,options);

See Also

calibsel, fullsearch, genalg, genalgplot, ipls,Genetic Algorithms for Variable Selection