Mdcheck

From Eigenvector Research Documentation Wiki
Revision as of 11:26, 10 August 2018 by imported>Neal (→‎Algorithm)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Purpose

Missing Data Checker and infiller.

Synopsis

[flag,missmap,infilled] = mdcheck(data,options)

Description

This function checks for missing data and infills it using a PCA model if desired. The input is the data to be checked data as either a double array or a dataset object. Optional input options is a structure containing options for how the function is to run (see below).

Outputs are the fraction of missing data flag, a map of the locations of the missing data as an unint8 variable missmap, and the data with the missing values filled in infilled. Depending on the plots option, a plot of the missing data may also be output.


Options

  • options = a structure array with the following fields:
  • plots: [{'none'} | 'final'], governs plot of missing data map,
  • display: [{'off'} | 'on'], governs level of display,
  • frac_ssq: [{0.95}] desired fraction between 0 and 1 of variance to be captured by the PCA model,
  • max_pcs: [{5}] maximum number of PCs in the model, if 0, then it uses the mean,
  • meancenter: ['no' | {'yes'}], tells whether to use mean centering in the algorithm,
  • recalcmean: ['no' | {'yes'}], recalculate mean center after each cycle of replacement (may improve results for small matricies),
  • tolerance: [{1e-6 100}] convergence criteria, the first element is the minimum change and the second is the maximum number of iterations,
  • max_missing: [{0.4}] maximum fraction of missing data with which MDCHECK will operate, and
  • toomuch: [{'error'} | 'exclude'] what action should be taken if too much missing data is found. 'error' exit with error message, 'exclude' will exclude elements (rows/columns/slabs/etc) which contain too much missing data from the data before replacement. 'exclude' requires a dataset object as input for (data),
  • algorithm: [ {'svd'} | 'nipals' | 'knn'] specified the missing data algorithm to use, NIPALS typically used for large amounts of missing data or large multi-way arrays. KNN works for sparsely populated data sets.

Note: For algorithm = 'svd' or 'nipals', MDCHECK captures up to options.frac_ssq of the variance using options.max_pcs or fewer PCA components.

The default options can be retreived using: options = mdcheck('options');.

Algorithm

The replacement algorithm is a successive approximations routine using PCA models based on SVDd or NIPALS to replace the data. Values for the missing data are first estimated using the mean of each variable. Then, a PCA model which captures a given percentage of the variance is calculated and the missing values are replaced again to be most consistent with the loadings of the PCA model (see the replace function.) The PCA model is recalculated using the newly replaced data and the process is repeated until the change in the replaced values drops below a threshold.

Using PCA to replace data generally works better than using the mean of a variable the because it uses the covariance in the data to estimate what the missing values should be. For example, given two variables:

  A    B
  1    2
  2    4
  3    6
  4    8
 NaN  10

Replacement of the NaN in column A with the mean of A would make the value 2.5. However, given that column B is always 2x column A, a value of 5 would be more consistent with the covariance of the two variables. A PCA model gives the correct value but using a simple mean-variable replacement leads to a 50% error.

The KNN algorithm (algorithm = 'knn') initializes by replacing missing data for a variable with the mean based on available (non-missing) data. KNN is then performed on the result and missing data replaced with the nearest night based on the KNN results.

See Also

excludemissing, parafac, pca, replace