Faq multiple class sets together in SIMCA PLSDA LDA

From Eigenvector Research Documentation Wiki
Revision as of 14:43, 13 November 2018 by imported>Lyle (Created page with "===Issue:=== Can I use multiple class sets (categorical variables) together in a SIMCA, PLSDA, or LDA model? ===Possible Solutions:=== In general, you can only operate on o...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Issue:

Can I use multiple class sets (categorical variables) together in a SIMCA, PLSDA, or LDA model?

Possible Solutions:

In general, you can only operate on one categorical variable (or classset) at a time (in fact, our Analysis GUI requires you to use only one categorical variable class at a time).

You could build individual SIMCA models (PCA model built on a single class) for each of the levels (i.e. members) of each of the categorical variables. But SIMCA models are generally independent of one another (adding another member of class A does not change the model for class B) so there is no effect of using multiple categorical variables there.

For PLSDA and the MLR equivalent (which is Linear Discriminant Ananlysis, LDA), there are a couple of different scenarios depending on how many "levels" your categorical variables have and whether or not the categorical variables are "complementary" (e.g. just inverses of each other).

The key to understanding what is useful and what isn't is how categorical variables are encoded as a y-block for classification in PLSDA. For a single categorical variable, each class is encoded into a separate "true" or "false" column in the y-block:

A     1  0
A     1  0
A     1  0
B --> 0  1
B     0  1
B     0  1

(column 1 is "Is A", column 2 is "Is B") Thus, the following two-level categorical variables, if combined, would be redundant and provide a trivial solution.

 A  C     1  0  1  0
 A  C     1  0  1  0
 A  C     1  0  1  0
 B  D --> 0  1  0  1
 B  D     0  1  0  1
 B  D     0  1  0  1

Using only one of these categories would give you the same answer as using both.

If you have two non-trivially different categorical variables (still two levels each), you can encode these similarly creating a four-column y-block:

 A  C     1  0  1  0
 A  D     1  0  0  1
 A  C     1  0  1  0
 B  D --> 0  1  0  1
 B  C     0  1  1  0
 B  D     0  1  0  1

If you wanted to create this y-block in PLS_Toolbox, you would use the command-line function class2logical with each of the separate categorical variables and combine the results:

 y = [class2logical(cat_AB)  class2logical(cat_CD)]

(NOTE: PLSDA in the Analysis GUI automatically handles converting classes into logical y-blocks. It is better to use this automatic management rather than a hand-constructed y-block)

However, notice that the first two y-columns are orthogonal to the second two y-columns. This is a strong indication that using these two categorical variables together in a PLSDA or LDA model will be DETRIMENTAL. In general, you will almost ALWAYS get a better model using one categorical variable at a time. This is because they are often working "at odds" from each other (the information needed to separate A's from B's is quite different than that needed to separate C's from D's) and forcing the model to do more than one thing at a time.

Finally, if any of your categorical variables has more than two levels, these get encoded as an n-column y-block (e.g. 3 levels yields 3 columns) and, although the same "trivial" vs. "non-trivial" rules for multiple categorical variables still hold, it is far less likely you will get any advantage in combining multiple categories.

 A      1  0  0
 A      1  0  0
 B      0  1  0
 B  --> 0  1  0
 C      0  0  1
 C      0  0  1

You can imagine that combining this three-column y-block with another 2 or three column y-block will only give the model more complexity to handle.