From Eigenvector Research Documentation Wiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.


Genetic algorithm for variable selection with PLS.


model = gaselctr(x,y,options)
[fit,pop,cavfit,cbfit] = gaselctr(x,y,options)


GASELCTR uses a genetic algorithm optimization to minimize cross validation error for variable selection.


  • x = the predictor block (x-block), and
  • y = the predicted block (y-block) (note that all scaling should be done prior to running GASELCTR).


  • model = a standard GENALG model structure with the following fields:
    • modeltype: 'GENALG' This field will always have this value.
    • datasource: {[1x1 struct] [1x1 struct]}, structures defining where the X- and Y-blocks came from.
    • date: date stamp for when GASELCTR was run.
    • time: time stamp for when GASELCTR was run.
    • info: 'Fit results in "rmsecv", population included variables in "icol"', information field describing where the fitness results for each member of the population are contained.
    • rmsecv: fitness results for each member of the population, for X MxN and Mp unique populations at convergence then rmsecv will be 1xMp.
    • icol: each row of icol corresponds to the variables used for that member of the population (a 1 [one] means that variable was used and a 0 [zero] means that it was not), for X MxN and Mp unique populations at convergence then icol will be MpxN, and
    • detail: [1x1 struct], a structure array containing model details including the following fields:
      • avefit: the average fitness at each generation.
      • bestfit: the best fitness at each generation, and
      • options: a structure corresponding to the options discussed above.

For the second output syntax shown above,

  • fit is the same as model.rmsecv
  • pop is the same as model.icol
  • cavfit is the same as model.detail.avefit
  • cbfit is the same as model.detail.bestfit


options is a structure array with the following fields:

  • plots: ['none' | {'intermediate'} | 'replicates' | 'final' ] Governs plots.
    • 'final' gives only a final summary plot.
    • 'replicates' gives plots at the end of each replicate.
    • 'intermediate' gives plots during analysis.
    • 'none' gives no plots.
  • display: [{'on'}| 'off' ] governs output to the command window.
  • popsize: {64} the population size (16<popsize<256 and popsize must be divisible by 4),
  • maxgenerations: {100} the maximum number of generations (25<mg<500),
  • mutationrate: {0.005} the mutation rate (typically 0.001<mt<0.01),
  • windowwidth: {1} the number of variables in a window (integer window width),
  • convergence: {50} percent of population the same at convergence (typically cn=80),
  • initialterms: {30} percent terms included at initiation (10<it<50),
  • crossover: {2} breeding cross-over rule (cr = 1: single cross-over; cr = 2: double cross-over),
  • algorithm: [ 'mlr' | {'pls'} ] regression algorithm,
  • ncomp: {10} maximum number of latent variables for PLS models,
  • cv: [ 'rnd' | {'con'} ] cross-validation option ('rnd': random subset cross-validation; 'con': contiguous block subset cross-validation),
  • split: {5} number of subsets to divide data into for cross-validation,
  • iter: {1} number of iterations for cross-validation at each generation,
  • preprocessing: {[ ] [ ]} a cell containing standard preprocessing structures for the X- and Y-blocks respectively (see PREPROCESS),
  • preapply: [ {0} | 1 } If 1, preprocessing is applied to data prior to GA. This speeds up the performance of the selection, but may reduce the accuracy of the cross-validation results. Output "fit" values should only be compared to each other. A full cross-validation should be run after analysis to get more accurate RMSECV values.
  • reps: {1} the number of replicate runs to perform,
  • target: a two element vector [target_min target_max] describing the target range for number of variables/terms included in a model n. Outside of this range, the penaltyslope option is applied by multiplying the fitness for each member of the population by:
penaltyslope*(target_min-n) when n<target_min, or
penaltyslope*(n-target_max) when n>target_max.
Field target is used to bias models towards a given range of included variables (see penaltyslope below),
  • targetpct: {1} flag indicating if values in field target are given in percent of variables (1) or in absolute number of variables (0), and
  • penaltyslope: {0} the slope of the penalty function (see target above).


To use mean centering outside the genetic algorithm (no additional centering will be performed within the algorithm) do the following:

  x2 = mncn(x);
  y2 = mncn(y);
  [fit,pop] = gaselctr(x2,y2);

To use mean centering inside the genetic algorithm (centering will be performed for each cross-validation subset) do the following:

  options = gaselctr('options');
  options.preprocessing{1} = preprocess('default', 'mean center');
  options.preprocessing{2} = preprocess('default', 'mean center');
  [fit,pop] = gaselctr(x2,y2,options);

See Also

calibsel, fullsearch, genalg, genalgplot, ipls,Genetic Algorithms for Variable Selection