Faq why get missing data warning: Difference between revisions
imported>Lyle (Created page with "===Issue:=== Why do I get the warning/notice "Missing Data Found - Replacing with "best guess" from existing model. Results may be affected by this action." ===Possible Solu...") |
imported>Lyle No edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 9: | Line 9: | ||
The implication of the warning is that, to build a model the algorithm requires values for all variables and samples. To handle this problem, PLS_Toolbox uses a data imputation algorithm which looks to replace missing data by estimating a value for the missing data points, building a PCA model of all the data, and then using that model to replace the missing data points again (this is then repeated until the replaced values converge on unchanging values). This procedure is not perfect and can still lead to samples which have high leverage or residuals (i.e. samples that are outliers) but if you have lots of missing data, it may be the only reasonable approach. | The implication of the warning is that, to build a model the algorithm requires values for all variables and samples. To handle this problem, PLS_Toolbox uses a data imputation algorithm which looks to replace missing data by estimating a value for the missing data points, building a PCA model of all the data, and then using that model to replace the missing data points again (this is then repeated until the replaced values converge on unchanging values). This procedure is not perfect and can still lead to samples which have high leverage or residuals (i.e. samples that are outliers) but if you have lots of missing data, it may be the only reasonable approach. | ||
If data is missing in only a couple of samples, you could exclude those samples, build a model from the remaining data. (You can also later use the PLS_Toolbox | If data is missing in only a couple of samples, you could exclude those samples, build a model from the remaining data. (You can also later use the PLS_Toolbox <code>replace</code> function to estimate the missing values for the excluded samples using that model and then rebuild the model with all data - this may give a better estimate than the PCA imputation method gives.) | ||
If data is missing from a lot of samples, you don't have any other real option. There are some algorithms which use weighting to ignore missing values. See, for example, the <code>tucker</code> and <code>tld</code> functions. | |||
'''Still having problems? Please contact our helpdesk at [mailto:helpdesk@eigenvector.com helpdesk@eigenvector.com]''' | '''Still having problems? Please contact our helpdesk at [mailto:helpdesk@eigenvector.com helpdesk@eigenvector.com]''' | ||
[[Category:FAQ]] | [[Category:FAQ]] |
Latest revision as of 13:05, 8 January 2019
Issue:
Why do I get the warning/notice "Missing Data Found - Replacing with "best guess" from existing model. Results may be affected by this action."
Possible Solutions:
The warning comes because you have NaN (Not a Number) in your data somewhere. NaN is "missing data" - data points you do not have values for. Sometimes this will happen with certain preprocessing, but the most likely cause is that when you imported your data, it had some missing data points.
The implication of the warning is that, to build a model the algorithm requires values for all variables and samples. To handle this problem, PLS_Toolbox uses a data imputation algorithm which looks to replace missing data by estimating a value for the missing data points, building a PCA model of all the data, and then using that model to replace the missing data points again (this is then repeated until the replaced values converge on unchanging values). This procedure is not perfect and can still lead to samples which have high leverage or residuals (i.e. samples that are outliers) but if you have lots of missing data, it may be the only reasonable approach.
If data is missing in only a couple of samples, you could exclude those samples, build a model from the remaining data. (You can also later use the PLS_Toolbox replace
function to estimate the missing values for the excluded samples using that model and then rebuild the model with all data - this may give a better estimate than the PCA imputation method gives.)
If data is missing from a lot of samples, you don't have any other real option. There are some algorithms which use weighting to ignore missing values. See, for example, the tucker
and tld
functions.
Still having problems? Please contact our helpdesk at helpdesk@eigenvector.com