Gaselctr: Difference between revisions
Jump to navigation
Jump to search
imported>Jeremy (Importing text file) |
imported>Jeremy (Importing text file) |
||
Line 1: | Line 1: | ||
===Purpose=== | ===Purpose=== | ||
Genetic algorithm for variable selection with PLS. | Genetic algorithm for variable selection with PLS. | ||
===Synopsis=== | ===Synopsis=== | ||
:model = gaselctr(x,y,options) | :model = gaselctr(x,y,options) | ||
:[fit,pop,avefit,bstfit] = gaselctr(x,y,''options'') | :[fit,pop,avefit,bstfit] = gaselctr(x,y,''options'') | ||
:options = gaselctr('options') | :options = gaselctr('options') | ||
===Description=== | ===Description=== | ||
GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection. | GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection. | ||
====INPUTS==== | ====INPUTS==== | ||
* '''x''' = the predictor block (x-block), and | * '''x''' = the predictor block (x-block), and | ||
* '''y''' = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR). | * '''y''' = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR). | ||
===Options=== | ===Options=== | ||
* '''''options''''' = a structure array with the following fields: | * '''''options''''' = a structure array with the following fields: | ||
* '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots. | * '''plots''': ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots. | ||
* ''''final'''' gives only a final summary plot. | * ''''final'''' gives only a final summary plot. | ||
* ''''replicates'''' gives plots at the end of each replicate. | * ''''replicates'''' gives plots at the end of each replicate. | ||
* ''''intermediate'''' gives plots during analysis. | * ''''intermediate'''' gives plots during analysis. | ||
* ''''none'''' gives no plots. | * ''''none'''' gives no plots. | ||
* '''popsize''': {64} the population size (16?popsize?256 and popsize must be divisible by 4), | * '''popsize''': {64} the population size (16?popsize?256 and popsize must be divisible by 4), | ||
* '''maxgenerations''': {100} the maximum number of generations (25?mg?500), | * '''maxgenerations''': {100} the maximum number of generations (25?mg?500), | ||
* '''mutationrate''': {0.005} the mutation rate (typically 0.001?mt?0.01), | * '''mutationrate''': {0.005} the mutation rate (typically 0.001?mt?0.01), | ||
* '''windowwidth''': {1} the number of variables in a window (integer window width), | * '''windowwidth''': {1} the number of variables in a window (integer window width), | ||
* '''convergence''': {50} percent of population the same at convergence (typically cn=80), | * '''convergence''': {50} percent of population the same at convergence (typically cn=80), | ||
* '''initialterms''': {30} percent terms included at initiation (10?bf?50), | * '''initialterms''': {30} percent terms included at initiation (10?bf?50), | ||
* '''crossover''': {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over), | * '''crossover''': {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over), | ||
* '''algorithm''': [ 'mlr' | {'pls'} ] regression algorithm, | * '''algorithm''': [ 'mlr' | {'pls'} ] regression algorithm, | ||
* '''ncomp''': {10} maximum number of latent variables for PLS models, | * '''ncomp''': {10} maximum number of latent variables for PLS models, | ||
* '''cv''': [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation), | * '''cv''': [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation), | ||
* '''split''': {5} number of subsets to divide data into for cross-validation, | * '''split''': {5} number of subsets to divide data into for cross-validation, | ||
* '''iter''': {1} number of iterations for cross-validation at each generation, | * '''iter''': {1} number of iterations for cross-validation at each generation, | ||
* '''preprocessing''': {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS), | * '''preprocessing''': {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS), | ||
* '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values. | * '''preapply''': [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values. | ||
* '''reps''': {1} the number of replicate runs to perform, | * '''reps''': {1} the number of replicate runs to perform, | ||
* '''target''': a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by: | * '''target''': a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by: | ||
* '''penaltyslope\*(target_min-n)''' when n<target_min, or | * '''penaltyslope\*(target_min-n)''' when n<target_min, or | ||
* '''penaltyslope\*(n-target_max)''' when n>target_max. | * '''penaltyslope\*(n-target_max)''' when n>target_max. | ||
* '''Field''' target is used to bias models towards a given range of included variables (see penaltyslope below), | * '''Field''' target is used to bias models towards a given range of included variables (see penaltyslope below), | ||
* '''targetpct''': {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and | * '''targetpct''': {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and | ||
* '''penaltyslope''': {0} the slope of the penalty function (see target above). | * '''penaltyslope''': {0} the slope of the penalty function (see target above). | ||
The default options can be retreived using: options = gaslctr('options');. | The default options can be retreived using: options = gaslctr('options');. | ||
====OUTPUTS==== | ====OUTPUTS==== | ||
* '''model''' = a standard GENALG model structure with the following fields: | * '''model''' = a standard GENALG model structure with the following fields: | ||
* '''modeltype''': 'GENALG' This field will always have this value, | * '''modeltype''': 'GENALG' This field will always have this value, | ||
* '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from | * '''datasource''': {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from | ||
* '''date''': date stamp for when GASELCTR was run, | * '''date''': date stamp for when GASELCTR was run, | ||
* '''time''': time stamp for when GASELCTR was run, | * '''time''': time stamp for when GASELCTR was run, | ||
* '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained, | * '''info''': 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained, | ||
* '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp'', | * '''rmsecv''': fitness results for each member of the population, for X ''M''x''N'' and ''Mp'' unique populations at convergence then rmsecv will be ''1xMp'', | ||
* '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and | * '''icol''': each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X ''M''x''N'' and ''Mp'' unique populations at convergence then icol will be ''Mp''x''N'', and | ||
* '''detail''': [1x1 struct], a structure array containing model details including the following fields: | * '''detail''': [1x1 struct], a structure array containing model details including the following fields: | ||
* '''avefit''': the average fitness at each generation, | * '''avefit''': the average fitness at each generation, | ||
* '''bestfit''': the best fitness at each generation, and | * '''bestfit''': the best fitness at each generation, and | ||
* '''options''': a structure corresponding to the options discussed above. | * '''options''': a structure corresponding to the options discussed above. | ||
===Examples=== | ===Examples=== | ||
To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following: | To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following: | ||
x2 = mncn(x); | x2 = mncn(x); | ||
:y2 = mncn(y); | :y2 = mncn(y); | ||
[fit,pop] = gaselctr(x2,y2); | [fit,pop] = gaselctr(x2,y2); | ||
To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following: | To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following: | ||
options = gaselctr('options'); | options = gaselctr('options'); | ||
:options.preprocessing{1} = preprocess('default', 'mean center'); | :options.preprocessing{1} = preprocess('default', 'mean center'); | ||
:options.preprocessing{2} = preprocess('default', 'mean center'); | :options.preprocessing{2} = preprocess('default', 'mean center'); | ||
[fit,pop] = gaselctr(x2,y2,options); | [fit,pop] = gaselctr(x2,y2,options); | ||
===See Also=== | ===See Also=== | ||
[[calibsel]], [[fullsearch]], [[genalg]], [[genalgplot]] | [[calibsel]], [[fullsearch]], [[genalg]], [[genalgplot]] |
Revision as of 15:25, 3 September 2008
Purpose
Genetic algorithm for variable selection with PLS.
Synopsis
- model = gaselctr(x,y,options)
- [fit,pop,avefit,bstfit] = gaselctr(x,y,options)
- options = gaselctr('options')
Description
GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.
INPUTS
- x = the predictor block (x-block), and
- y = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).
Options
- options = a structure array with the following fields:
- plots: ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
- 'final' gives only a final summary plot.
- 'replicates' gives plots at the end of each replicate.
- 'intermediate' gives plots during analysis.
- 'none' gives no plots.
- popsize: {64} the population size (16?popsize?256 and popsize must be divisible by 4),
- maxgenerations: {100} the maximum number of generations (25?mg?500),
- mutationrate: {0.005} the mutation rate (typically 0.001?mt?0.01),
- windowwidth: {1} the number of variables in a window (integer window width),
- convergence: {50} percent of population the same at convergence (typically cn=80),
- initialterms: {30} percent terms included at initiation (10?bf?50),
- crossover: {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
- algorithm: [ 'mlr' | {'pls'} ] regression algorithm,
- ncomp: {10} maximum number of latent variables for PLS models,
- cv: [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
- split: {5} number of subsets to divide data into for cross-validation,
- iter: {1} number of iterations for cross-validation at each generation,
- preprocessing: {[] []} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
- preapply: [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but my reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
- reps: {1} the number of replicate runs to perform,
- target: a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
- penaltyslope\*(target_min-n) when n<target_min, or
- penaltyslope\*(n-target_max) when n>target_max.
- Field target is used to bias models towards a given range of included variables (see penaltyslope below),
- targetpct: {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
- penaltyslope: {0} the slope of the penalty function (see target above).
The default options can be retreived using: options = gaslctr('options');.
OUTPUTS
- model = a standard GENALG model structure with the following fields:
- modeltype: 'GENALG' This field will always have this value,
- datasource: {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from
- date: date stamp for when GASELCTR was run,
- time: time stamp for when GASELCTR was run,
- info: 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained,
- rmsecv: fitness results for each member of the population, for X MxN and Mp unique populations at convergence then rmsecv will be 1xMp,
- icol: each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X MxN and Mp unique populations at convergence then icol will be MpxN, and
- detail: [1x1 struct], a structure array containing model details including the following fields:
- avefit: the average fitness at each generation,
- bestfit: the best fitness at each generation, and
- options: a structure corresponding to the options discussed above.
Examples
To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:
x2 = mncn(x);
- y2 = mncn(y);
[fit,pop] = gaselctr(x2,y2);
To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:
options = gaselctr('options');
- options.preprocessing{1} = preprocess('default', 'mean center');
- options.preprocessing{2} = preprocess('default', 'mean center');
[fit,pop] = gaselctr(x2,y2,options);