Gaselctr: Difference between revisions

Revision as of 20:56, 2 September 2008

Purpose

Genetic algorithm for variable selection with PLS.

Synopsis

model = gaselctr(x,y,options)

[fit,pop,avefit,bstfit] = gaselctr(x,y,options)

options = gaselctr('options')

Description

GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.

INPUTS

x = the predictor block (x-block), and
y = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).

Options

options = a structure array with the following fields:
plots: ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
'final' gives only a final summary plot.
'replicates' gives plots at the end of each replicate.
'intermediate' gives plots during analysis.
'none' gives no plots.
popsize: {64} the population size (16?popsize?256 and popsize must be divisible by 4),
maxgenerations: {100} the maximum number of generations (25?mg?500),
mutationrate: {0.005} the mutation rate (typically 0.001?mt?0.01),
windowwidth: {1} the number of variables in a window (integer window width),
convergence: {50} percent of population the same at convergence (typically cn=80),
initialterms: {30} percent terms included at initiation (10?bf?50),
crossover: {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
algorithm: [ 'mlr' | {'pls'} ] regression algorithm,
ncomp: {10} maximum number of latent variables for PLS models,
cv: [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
split: {5} number of subsets to divide data into for cross-validation,
iter: {1} number of iterations for cross-validation at each generation,
preprocessing: {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
preapply: [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
reps: {1} the number of replicate runs to perform,
target: a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
penaltyslope\*(target_min-n) when n<target_min, or
penaltyslope\*(n-target_max) when n>target_max.
Field target is used to bias models towards a given range of included variables (see penaltyslope below),
targetpct: {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
penaltyslope: {0} the slope of the penalty function (see target above).

The default options can be retreived using: options = gaslctr('options');.

OUTPUTS

model = a standard GENALG model structure with the following fields:
modeltype: 'GENALG' This field will always have this value,
datasource: {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from
date: date stamp for when GASELCTR was run,
time: time stamp for when GASELCTR was run,
info: 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,
rmsecv: fitness results for each member of the population, for X MxN and Mp unique populations at convergence then rmsecv will be 1xMp,
icol: each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X MxN and Mp unique populations at convergence then icol will be MpxN, and
detail: [1x1 struct], a structure array containing model details including the following fields:
avefit: the average fitness at each generation,
bestfit: the best fitness at each generation, and
options: a structure corresponding to the options discussed above.

Examples

To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following: x2 = mncn(x);

y2 = mncn(y);

[fit,pop] = gaselctr(x2,y2); To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following: options = gaselctr('options');

options.preprocessing{1} = preprocess('default', 'mean center');

options.preprocessing{2} = preprocess('default', 'mean center');

[fit,pop] = gaselctr(x2,y2,options);

@@ Line 7: / Line 7: @@
 ===Description===
 GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.
-INPUTS:
+====INPUTS====
-* x = the predictor block (x-block), and
+* '''x''' = the predictor block (x-block), and
-* y = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).
+* '''y''' = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).
 ===Options===
-* ''options'' = a structure array with the following fields:
+* '''''options''''' = a structure array with the following fields:
-* plots: ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
+* '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
-*  'final' gives only a final summary plot.
+*  ''''final'''' gives only a final summary plot.
-*  'replicates' gives plots at the end of each replicate.
+*  ''''replicates'''' gives plots at the end of each replicate.
-*  'intermediate' gives plots during analysis.
+*  ''''intermediate'''' gives plots during analysis.
-*  'none' gives no plots.
+*  ''''none'''' gives no plots.
-* popsize: {64} the population size (16?popsize?256 and popsize must be divisible by 4),
+* '''popsize''': {64} the population size (16?popsize?256 and popsize must be divisible by 4),
-* maxgenerations: {100} the maximum number of generations (25?mg?500),
+* '''maxgenerations''': {100} the maximum number of generations (25?mg?500),
-* mutationrate: {0.005} the mutation rate (typically 0.001?mt?0.01),
+* '''mutationrate''': {0.005} the mutation rate (typically 0.001?mt?0.01),
-* windowwidth: {1} the number of variables in a window (integer window width),
+* '''windowwidth''': {1} the number of variables in a window (integer window width),
-* convergence: {50} percent of population the same at convergence (typically cn=80),
+* '''convergence''': {50} percent of population the same at convergence (typically cn=80),
-* initialterms: {30} percent terms included at initiation (10?bf?50),
+* '''initialterms''': {30} percent terms included at initiation (10?bf?50),
-* crossover: {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
+* '''crossover''': {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
-* algorithm: [ 'mlr' | {'pls'} ] regression algorithm,
+* '''algorithm''': [ 'mlr' | {'pls'} ] regression algorithm,
-* ncomp: {10} maximum number of latent variables for PLS models,
+* '''ncomp''': {10} maximum number of latent variables for PLS models,
-* cv: [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
+* '''cv''': [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
-* split: {5} number of subsets to divide data into for cross-validation,
+* '''split''': {5} number of subsets to divide data into for cross-validation,
-* iter: {1} number of iterations for cross-validation at each generation,
+* '''iter''': {1} number of iterations for cross-validation at each generation,
-* preprocessing: {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
+* '''preprocessing''': {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
-* preapply: [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
+* '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
-* reps: {1} the number of replicate runs to perform,
+* '''reps''': {1} the number of replicate runs to perform,
-* target: a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
+* '''target''': a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
-*   penaltyslope\*(target_min-n) when n<target_min, or
+*   '''penaltyslope\*(target_min-n)''' when n<target_min, or
-*   penaltyslope\*(n-target_max) when n>target_max.
+*   '''penaltyslope\*(n-target_max)''' when n>target_max.
-*  Field target is used to bias models towards a given range of included variables (see penaltyslope below),
+*  '''Field''' target is used to bias models towards a given range of included variables (see penaltyslope below),
-* targetpct: {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
+* '''targetpct''': {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
-* penaltyslope: {0} the slope of the penalty function (see target above).
+* '''penaltyslope''': {0} the slope of the penalty function (see target above).
 The default options can be retreived using: options = gaslctr('options');.
-OUTPUT:
+====OUTPUTS====
-* model = a standard GENALG model structure with the following fields:
+* '''model''' = a standard GENALG model structure with the following fields:
-* modeltype: 'GENALG' This field will always have this value,
+* '''modeltype''': 'GENALG' This field will always have this value,
-* datasource: {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from
+* '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from
-* date: date stamp for when GASELCTR was run,
+* '''date''': date stamp for when GASELCTR was run,
-* time:  time stamp for when GASELCTR was run,
+* '''time''':  time stamp for when GASELCTR was run,
-* info: 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,
+* '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,
-* rmsecv: fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp'',
+* '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp'',
-* icol: each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and
+* '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and
-* detail: [1x1 struct], a structure array containing model details including the following fields:
+* '''detail''': [1x1 struct], a structure array containing model details including the following fields:
-*  avefit: the average fitness at each generation,
+*  '''avefit''': the average fitness at each generation,
-*  bestfit: the best fitness at each generation, and
+*  '''bestfit''': the best fitness at each generation, and
-*  options: a structure corresponding to the options discussed above.
+*  '''options''': a structure corresponding to the options discussed above.
 ===Examples===
 To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:

Gaselctr: Difference between revisions

Revision as of 20:56, 2 September 2008

Contents

Purpose

Synopsis

Description

INPUTS

Options

OUTPUTS

Examples

See Also

Navigation menu

Gaselctr: Difference between revisions

Revision as of 20:56, 2 September 2008

Purpose

Synopsis

Description

INPUTS

Options

OUTPUTS

Examples

See Also

Navigation menu

Search