How to do factor analysis

by on January 27, 2015

This is a guest post by Evan Warfel. The code and results are available on Domino.

Background

P-values. T-tests. Categorical variables. All are contenders for the most misused statistical technique or data scientific tool. Yet factor analysis is a whole different ball game. Though far from over-used, it is unquestionably the most controversial statistical technique, due to its role in debates about general intelligence. You didn’t think statistical techniques could be divisive, did you? [Not counting that one time that the U.C. Berkeley stats department left Andrew Gelman.]

In data science texts, factor analysis is that technique that is always mentioned along with PCA, and then subsequently ignored. It is like the Many Worlds interpretation of quantum mechanics, nay, it is the Star Wars Holiday Special of data science. Everyone is vaguely familiar with it, but no one seems to really understand it.

Factor analysis aims to give insight into the latent variables that are behind people’s behavior and the choices that they make. PCA, on the other hand, is all about the most compact representation of a dataset by picking dimensions that capture the most variance. This distinction can be subtle, but one notable difference is that PCA assumes no error of measurement or noise in the data: all of the ‘noise’ is folded into the variance capturing. Another important difference is that the number of researcher degrees of freedom, or choices one has to make, is much greater than that of PCA. Not only does one have to choose the number of factors to extract (there are ~10 theoretical criteria which rarely converge), but then decide on the method of extraction (there are ~7), as well as the type of rotation (there are also 7), as well as whether to use a variance or covariance matrix, and so on.

The goal of factor analysis is to figure out if many individual behaviors of users can’t be explained by a smaller number of latent characteristics. Making this more concrete, imagine you operate a restaurant. Although some of your customers eat pretty healthily, you notice that many often order a side of Poutine with their otherwise dietetic kale salad. Being an inquisitive and data oriented restaurateur, you come up with a hypothesis – Every order can be explained with one ‘healthfulness’ dimension, and people who order Poutine and kale salad at the same time are somewhere in the middle of a dimension characterized by ‘Exclusive Kale and Squash eaters’ on one end, and ‘Eats nothing but bacon’ on the other.

However, you note that this might not explain differences in how customers actually place their orders, so you come up with another – Perhaps there are two dimensions, one which is about how much people love or hate kale, and the other is about how much people love or hate Poutine. Maybe these dimensions, you reason, are orthogonal. Maybe they are somewhat negatively correlated. Maybe there is also a dimension concerned with how much your customers like the other competing dishes on the menu.

You start to notice a trend in your own hypotheses, and realize that there could be any number of theorized dimensions. What you really want to know is the smallest number of dimensions that explain the most amount of variance in how your customers place their orders. Presumably so you can keep more Poutine to yourself. (‘Cause that stuff is delicious).

More technically, running a factor analysis is the mathematical equivalent of asking a statistically savvy oracle the following: “Suppose there are N latent variables that are influencing people’s choices –tell me how much each variable influence the responses for each item that I see, assuming that there is measurement error on everything”. Often times, the ‘behavior’ or responses that are being analyzed comes in the form of how people answer questions on surveys.

Mathematically speaking, for person i, item j and behavior Yij, Factor analysis seeks to determine the following:

Yij = Wj1 * Fi1 + Wj2 * Fi2 + … + Uij

Where W’s are the factor weights or loadings, F’s are the factors, and U is the measurement error / the variance that can’t be accounted for by the other terms in the equation. The insight of the people who created factor analysis was that this equation is actually a matrix reduction problem.

Just like any technique, it won’t run blind – you have to determine the number of factors to extract, similar to picking the number of dimensions to reduce to with PCA. There are numerous indicators about which number you should pick; I’ll go over the best ones later. I mention this because when you read guides and papers about factor analysis, the biggest concern is extracting the right number of factors properly. Make no mistake—this is something to worry about. However, the most important part of factor or all data analysis for that matter, alas, is almost never mentioned. The number one thing to be mindful of when doing data or factor analysis is the tendency for your brain has to lie to you. Given the striking number of researcher degrees of freedom involved in factor analysis, it is very easy to justify making different choices because the results don’t conform to your intuitions.

Don’t believe me? Try this: Jack, George, and Anne are guests at a dinner party. Jack is looking at Anne, and Anne is looking at George. Jack is married, George is not. Is a married person looking at an unmarried person?

A) Yes
B) No
C) Not enough information.

Most people upon reading this question realize there is a trick involved, and grab a piece of paper to work out the answer. They then pick C. It seems logical. Yet the correct answer is A – it doesn’t matter if Anne is married or not. Reno really is west of Los Angeles. The struggle continues.

Unless you take an intentional stance against it, your brain will try and rationalize it’s preconceived notions on to your analysis. This usually takes the form of rounding factor loadings up or down, or justifying how many factors to extract. Remember: You want to believe what the data says you should believe.

Actually doing factor analysis

The code shown below is available on Domino, where you can also see its output.

Preprocessing

Before you do factor analysis, you’ll need a few things.

First: Get R if you don’t already have it. Then Get the ‘Psych’ package. It is un-paralleled as free Factor Analysis software. Load it by typing library(psych)

Next: Get Data. On my end, I’ll be using the ‘bfi’ dataset that comes with the psych package. The data is in the form of responses to personality questions known as the Big Five Inventory. However, any sort of record of behavior will do – at the end of the day, you’ll need to be able to make a full correlation matrix. The larger your sample size, the better. A sample size of 400-500 is generally agreed to be a good rule of thumb.

Now the fun begins. The psych package as a handy describe function. (Note: Each alphanumeric pair represents a personality ‘trait’, taken with a 6-point likert scale.)

> describe(bfi)

          vars    n  mean    sd median trimmed   mad min max range  skew kurtosis   se
A1           1 2784  2.41  1.41      2    2.23  1.48   1   6     5  0.83    -0.31 0.03
A2           2 2773  4.80  1.17      5    4.98  1.48   1   6     5 -1.12     1.05 0.02
A3           3 2774  4.60  1.30      5    4.79  1.48   1   6     5 -1.00     0.44 0.02
A4           4 2781  4.70  1.48      5    4.93  1.48   1   6     5 -1.03     0.04 0.03
A5           5 2784  4.56  1.26      5    4.71  1.48   1   6     5 -0.85     0.16 0.02
C1           6 2779  4.50  1.24      5    4.64  1.48   1   6     5 -0.85     0.30 0.02
C2           7 2776  4.37  1.32      5    4.50  1.48   1   6     5 -0.74    -0.14 0.03
C3           8 2780  4.30  1.29      5    4.42  1.48   1   6     5 -0.69    -0.13 0.02
C4           9 2774  2.55  1.38      2    2.41  1.48   1   6     5  0.60    -0.62 0.03
C5          10 2784  3.30  1.63      3    3.25  1.48   1   6     5  0.07    -1.22 0.03
E1          11 2777  2.97  1.63      3    2.86  1.48   1   6     5  0.37    -1.09 0.03
E2          12 2784  3.14  1.61      3    3.06  1.48   1   6     5  0.22    -1.15 0.03
E3          13 2775  4.00  1.35      4    4.07  1.48   1   6     5 -0.47    -0.47 0.03
E4          14 2791  4.42  1.46      5    4.59  1.48   1   6     5 -0.82    -0.30 0.03
E5          15 2779  4.42  1.33      5    4.56  1.48   1   6     5 -0.78    -0.09 0.03
N1          16 2778  2.93  1.57      3    2.82  1.48   1   6     5  0.37    -1.01 0.03
N2          17 2779  3.51  1.53      4    3.51  1.48   1   6     5 -0.08    -1.05 0.03
N3          18 2789  3.22  1.60      3    3.16  1.48   1   6     5  0.15    -1.18 0.03
N4          19 2764  3.19  1.57      3    3.12  1.48   1   6     5  0.20    -1.09 0.03
N5          20 2771  2.97  1.62      3    2.85  1.48   1   6     5  0.37    -1.06 0.03
O1          21 2778  4.82  1.13      5    4.96  1.48   1   6     5 -0.90     0.43 0.02
O2          22 2800  2.71  1.57      2    2.56  1.48   1   6     5  0.59    -0.81 0.03
O3          23 2772  4.44  1.22      5    4.56  1.48   1   6     5 -0.77     0.30 0.02
O4          24 2786  4.89  1.22      5    5.10  1.48   1   6     5 -1.22     1.08 0.02
O5          25 2780  2.49  1.33      2    2.34  1.48   1   6     5  0.74    -0.24 0.03
gender      26 2800  1.67  0.47      2    1.71  0.00   1   2     1 -0.73    -1.47 0.01
education   27 2577  3.19  1.11      3    3.22  1.48   1   5     4 -0.05    -0.32 0.02
age         28 2800 28.78 11.13     26   27.43 10.38   3  86    83  1.02     0.56 0.21

There is some demographic data included in this dataset, which I will trim for the factor analysis.

df <- bfi[1:25]

While factor analysis works for both covariance as well as correlation matrices, the recommended practice is to use a correlation matrix. That’s right—All you really need is a correlation matrix of different indicators of behavior (even if that behavior is ‘clicking on a button’, ‘answering a question a certain way’, or ‘actually giving us money’).

Determining the number of factors

Though there are myriad indicators for the ‘proper’ number of factors to extract, there are two main techniques, other than the time-honored tradition of inspecting various factor solutions and interpreting the results. The first is to inspect a scree plot, or an’eigenvalue vs. number of factors / components’ chart. (A screed plot, on the other hand, usually involves lots of cement)

>scree(df)

After a certain point, each additional factor or component will result in mere marginal reduction of eigenvalue. (Translation: Each additional factor doesn’t explain too much more variance.) There will generally be some sort of ‘Elbow’, and the idea is you pick the last factor that still reduces the variance. Is this subjective? Yes. Can intuition be built around this rule? Yes.

A word of caution: There is a tendency to just take the number of factors whose eigenvalues are greater than one. This is a near universal mistake. Don’t do it. You’ve been warned. Also, it helps to make sure that you are viewing the scree plot full-sized, and not just in the small rstudio plot window.

In this case, if we strictly follow the ‘find the elbow’ rule, it looks like ‘6’ is the highest number one could get away with. There are more sophisticated methods for double checking the number of factors to extract, like parallel analysis. A description of parallel analysis, courtesy of The Journal of Vegetation Science: “In this procedure, eigenvalues from a data set prior to rotation are compared with those from a matrix of random values of the same dimensionality (p variables and n samples).” The idea is that any eigenvalues below those generated by random chance are superfluous.

>fa.parallel(bfi)

Parallel analysis suggests that the number of factors = 6 and the number of components = 6

Here is the plot output:

By know you are thinking “okay, we’ve decided on the number of factors to extract. Can we just get it over with already? My buddy is doing PCA and she’s already left to go eat her Kale and Poutine lunch.” Not so fast. We have to figure out how to extract the 6 factors and then if and how we want to rotate them to aid our interpretation.

Factor Extraction

There are a plethora of factor extraction techniques, the merits of most of which are compared in this (useful) thrill-a-minute thesis.

Here is what you need to know. There are three main factor extraction techniques: Ordinary least Squares (also called ‘Minimum Residuals’, or ‘Minres’, for short), Maximum Likelihood, and Principal Axis factoring. OLS / Minres has been found to outperform other methods in a variety of situations, and usually gives a solution close to what you would get if you used Maximum Likelihood. Maximum Likelihood is useful because you can calculate confidence intervals. Principal axis factoring is a widely used method that places most of the variance on the first factor. As with all data analysis, if you have robust, meaningful results or signal in your data from a new method or experiment, than what you’ll be concerned with should be invariant to the factor extraction technique. But if your work is sensitive to smaller differences in factor loading scores and interpretation, than it is worth taking the time to figure out which tool is best for you. For an exploratory analysis of the bfi data, the ols / minres method suffices

Rotation

Factor extraction is one thing, but they are usually difficult to interpret, which arguably defeats the whole point of this exercise. To adjust for this, it is common to ‘rotate’, or choose slightly different axes in the n-factor subspace so that your results are more interpretable. Essentially, rotation sacrifices some of the explained variance for actually knowing what is going on. (This is a little hand wavy, but rotation is strongly recommended by most, if not all of notable 20th-century psychometricians.)

Quite unlike Kale, rotation comes in two distinct flavors. An orthogonal rotation assumes that the factors uncorrelated, while an oblique rotation assumes they are correlated. The choice between orthogonal vs. oblique choice depends on your particular use-case. If your data consists of items from one large domain and you have no reason to think that certain behaviors could be completely uncorrelated, use oblique rotation. If you want to know more, see here for a brief overview, and here for a little more depth.

Two popular types of rotation are Varimax (orthogonal), and Oblimin (oblique). Given that the data I am analyzing is based on personality items, I’ll choose oblimin rotation, as there is good apriori reason to assume that the factors of personality are not orthogonal.

Factor analysis has a really simple command in R:

> fa(df,6,fm='minres',rotate='oblimin')

Factor Analysis using method =  minres
Call: fa(r = df, nfactors = 6, rotate = "oblimin", fm = "minres")
Standardized loadings (pattern matrix) based upon correlation matrix
     MR2   MR1   MR3   MR5   MR4   MR6   h2   u2 com
A1  0.10 -0.11  0.08 -0.56  0.05  0.28 0.33 0.67 1.7
A2  0.04 -0.03  0.07  0.69  0.00 -0.06 0.50 0.50 1.0
A3 -0.01 -0.12  0.03  0.62  0.06  0.10 0.51 0.49 1.2
A4 -0.07 -0.06  0.20  0.39 -0.11  0.15 0.28 0.72 2.2
A5 -0.16 -0.21  0.01  0.45  0.12  0.21 0.48 0.52 2.3
C1  0.01  0.05  0.55 -0.06  0.18  0.07 0.35 0.65 1.3
C2  0.06  0.13  0.68  0.01  0.11  0.17 0.50 0.50 1.3
C3  0.01  0.06  0.55  0.09 -0.05  0.04 0.31 0.69 1.1
C4  0.05  0.08 -0.63 -0.07  0.06  0.30 0.55 0.45 1.5
C5  0.14  0.19 -0.54 -0.01  0.11  0.07 0.43 0.57 1.5
E1 -0.13  0.59  0.11 -0.12 -0.09  0.08 0.38 0.62 1.3
E2  0.05  0.69 -0.01 -0.07 -0.06  0.03 0.55 0.45 1.1
E3  0.00 -0.35  0.01  0.15  0.39  0.21 0.48 0.52 2.9
E4 -0.05 -0.55  0.03  0.19  0.03  0.29 0.56 0.44 1.8
E5  0.17 -0.41  0.26  0.07  0.22 -0.02 0.40 0.60 2.9
N1  0.85 -0.09  0.00 -0.06 -0.05  0.00 0.70 0.30 1.0
N2  0.85 -0.04  0.01 -0.02 -0.01 -0.08 0.69 0.31 1.0
N3  0.64  0.15 -0.04  0.07  0.06  0.11 0.52 0.48 1.2
N4  0.39  0.44 -0.13  0.07  0.11  0.09 0.48 0.52 2.5
N5  0.40  0.25  0.00  0.16 -0.09  0.20 0.35 0.65 2.8
O1 -0.05 -0.05  0.08 -0.04  0.56  0.03 0.34 0.66 1.1
O2  0.11 -0.01 -0.07  0.08 -0.37  0.35 0.29 0.71 2.4
O3 -0.02 -0.10  0.02  0.03  0.66  0.00 0.48 0.52 1.1
O4  0.08  0.35 -0.02  0.15  0.38 -0.02 0.25 0.75 2.4
O5  0.03 -0.06 -0.02 -0.05 -0.45  0.40 0.37 0.63 2.1

                       MR2  MR1  MR3  MR5  MR4  MR6
SS loadings           2.42 2.22 2.04 1.88 1.67 0.83
Proportion Var        0.10 0.09 0.08 0.08 0.07 0.03
Cumulative Var        0.10 0.19 0.27 0.34 0.41 0.44
Proportion Explained  0.22 0.20 0.18 0.17 0.15 0.07
Cumulative Proportion 0.22 0.42 0.60 0.77 0.93 1.00

 With factor correlations of 
      MR2   MR1   MR3   MR5   MR4   MR6
MR2  1.00  0.25 -0.18 -0.10  0.02  0.18
MR1  0.25  1.00 -0.22 -0.31 -0.19 -0.06
MR3 -0.18 -0.22  1.00  0.20  0.19 -0.03
MR5 -0.10 -0.31  0.20  1.00  0.25  0.15
MR4  0.02 -0.19  0.19  0.25  1.00  0.02
MR6  0.18 -0.06 -0.03  0.15  0.02  1.00

Mean item complexity =  1.7
Test of the hypothesis that 6 factors are sufficient.

The degrees of freedom for the null model are  300  and the objective function was  7.23 with Chi Square of  20163.79
The degrees of freedom for the model are 165  and the objective function was  0.36 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic number of observations is  2762 with the empirical chi square  660.84  with prob <  1.6e-60 
The total number of observations was  2800  with MLE Chi Square =  1013.9  with prob <  4.4e-122 

Tucker Lewis Index of factoring reliability =  0.922
RMSEA index =  0.043  and the 90 % confidence intervals are  0.04 0.045
BIC =  -295.76
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                MR2  MR1  MR3  MR5  MR4  MR6
Correlation of scores with factors             0.93 0.89 0.88 0.87 0.85 0.77
Multiple R square of scores with factors       0.87 0.80 0.78 0.77 0.73 0.59
Minimum correlation of possible factor scores  0.73 0.59 0.56 0.53 0.46 0.18

There is a lot in this output, I won’t unpack it all – You can find more detail in the documentation of the psych package. Included in the printout are metrics about how well the model fit the data. The standard rule of thumb is that the RMSEA index should be less than .06. I’ve highlighted it to make things easier. The other metrics can be valuable, but each has a specific case or three for which it doesn’t work, rmsea works across the board. Double check to make sure this value isn’t too high.

Then, the fun part – do a rough inspection of the factors by calling

> print(fa(df,6,fm='minres',rotate='oblimin')$loadings,cut=.2)

Loadings:
    MR2     MR1     MR3     MR5     MR4     MR6   
A1                  -0.558                  0.278
A2                  0.690   
A3                  0.619   
A4                  0.392   
A5          -0.207          0.451           0.208
C1                  0.548  
C2                  0.681  
C3                  0.551  
C4                  0.632                   0.300
C5                  0.540                     
E1          0.586   
E2          0.686   
E3          0.349                   0.391   0.207
E4          -0.551                          0.288
E5          -0.405  0.264           0.224       
N1  0.850                                   
N2  0.850                                   
N3  0.640                                   
N4  0.390   0.436                            
N5  0.403   0.255                           0.202
O1                                  0.563       
O2                                  -0.367  0.352
O3                                  0.656       
O4          0.354                   0.375 
O5                                  -0.451  0.400

                 MR2   MR1   MR3   MR5   MR4   MR6
SS loadings    2.305 1.973 1.925 1.700 1.566 0.777
Proportion Var 0.092 0.079 0.077 0.068 0.063 0.031
Cumulative Var 0.092 0.171 0.248 0.316 0.379 0.410

Always be sure to look at the last factor—in this case, none of the loadings on the last factor are the highest, which suggests that it is unnecessary. Thus we move to a 5-factor solution:

> fa(df,5,fm='minres','oblimin')

Factor Analysis using method =  minres
Call: fa(r = df, nfactors = 5, n.obs = "oblimin", fm = "minres")
Standardized loadings (pattern matrix) based upon correlation matrix
     MR2   MR3   MR5   MR1   MR4   h2   u2 com
A1  0.20  0.04 -0.36 -0.14 -0.04 0.15 0.85 2.0
A2 -0.02  0.09  0.60  0.01  0.03 0.40 0.60 1.1
A3 -0.03  0.03  0.67 -0.07  0.04 0.51 0.49 1.0
A4 -0.06  0.20  0.46 -0.04 -0.15 0.29 0.71 1.7
A5 -0.14  0.00  0.58 -0.17  0.06 0.48 0.52 1.3
C1  0.06  0.53  0.00  0.05  0.16 0.32 0.68 1.2
C2  0.13  0.64  0.11  0.13  0.06 0.43 0.57 1.2
C3  0.04  0.56  0.11  0.08 -0.06 0.32 0.68 1.1
C4  0.12 -0.64  0.06  0.04 -0.03 0.47 0.53 1.1
C5  0.14 -0.57  0.01  0.16  0.10 0.43 0.57 1.4
E1 -0.09  0.10 -0.10  0.56 -0.11 0.37 0.63 1.3
E2  0.06 -0.03 -0.09  0.67 -0.07 0.55 0.45 1.1
E3  0.06 -0.02  0.30 -0.34  0.31 0.44 0.56 3.0
E4  0.00  0.01  0.36 -0.53 -0.05 0.52 0.48 1.8
E5  0.18  0.27  0.08 -0.39  0.22 0.40 0.60 3.1
N1  0.85  0.01 -0.09 -0.09 -0.05 0.71 0.29 1.1
N2  0.82  0.02 -0.08 -0.04  0.01 0.66 0.34 1.0
N3  0.67 -0.06  0.10  0.14  0.03 0.53 0.47 1.2
N4  0.41 -0.16  0.09  0.42  0.08 0.48 0.52 2.4
N5  0.44 -0.02  0.22  0.25 -0.14 0.34 0.66 2.4
O1 -0.01  0.06  0.02 -0.06  0.53 0.32 0.68 1.1
O2  0.16 -0.10  0.21 -0.03 -0.44 0.24 0.76 1.9
O3  0.01  0.00  0.09 -0.10  0.63 0.47 0.53 1.1
O4  0.08 -0.04  0.14  0.36  0.38 0.26 0.74 2.4
O5  0.11 -0.05  0.10 -0.07 -0.52 0.27 0.73 1.2

                                MR2  MR3  MR5  MR1  MR4
SS loadings                    2.49 2.05 2.10 2.07 1.64
Proportion Var                 0.10 0.08 0.08 0.08 0.07
Cumulative Var                  0.10 0.18 0.27 0.35 0.41
Proportion Explained            0.24 0.20 0.20 0.20 0.16
Cumulative Proportion         0.24 0.44 0.64 0.84 1.00

 With factor correlations of 
      MR2   MR3   MR5   MR1   MR4
MR2  1.00 -0.21 -0.03  0.23 -0.01
MR3 -0.21  1.00  0.20 -0.22  0.20
MR5 -0.03  0.20  1.00 -0.31  0.23
MR1  0.23 -0.22 -0.31  1.00 -0.17
MR4 -0.01  0.20  0.23 -0.17  1.00

Mean item complexity =  1.6
Test of the hypothesis that 5 factors are sufficient.

The degrees of freedom for the null model are  300  and the objective function was  7.23 with Chi Square of  20163.79
The degrees of freedom for the model are 185  and the objective function was  0.63 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.04 

The harmonic number of observations is  2762 with the empirical chi square  1474.6  with prob <  1.3e-199 
The total number of observations was  2800  with MLE Chi Square =  1749.88  with prob <  1.4e-252 

Tucker Lewis Index of factoring reliability =  0.872
RMSEA index =  0.055  and the 90 % confidence intervals are  0.053 0.057
BIC =  281.47
Fit based upon off diagonal values = 0.98
Measures of factor score adequacy             
                                                MR2  MR3  MR5  MR1  MR4
Correlation of scores with factors             0.93 0.88 0.88 0.88 0.85
Multiple R square of scores with factors       0.86 0.77 0.78 0.78 0.72
Minimum correlation of possible factor scores  0.73 0.54 0.56 0.56 0.44

These metrics tell us that solution is certainly not terrible, and thus on to the factor inspection:

> print(fa(df,5,fm='minres',rotate='oblimin')$loadings,cut=.2)

Loadings:
        MR2     MR3     MR5     MR1     MR4 
A1      0.204           -0.360          
A2                      0.603           
A3                      0.668           
A4                      0.456           
A5                      0.577           
C1              0.532                    
C2              0.637                    
C3              0.564                    
C4              -0.643                   
C5              -0.571                   
E1                              0.565    
E2                              0.667    
E3                      0.303   -0.342  0.315
E4                      0.362   -0.527   
E5                      0.274   -0.394  0.223
N1  0.852                           
N2  0.817                           
N3  0.665                           
N4  0.413               0.420    
N5  0.439               0.223   0.247    
O1                                      0.534
O2                      0.211           -0.441
O3                                      0.633
O4                              0.357   0.378
O5                                      -0.522

                 MR2   MR3   MR5   MR1   MR4
SS loadings    2.412 1.928 1.922 1.839 1.563
Proportion Var 0.096 0.077 0.077 0.074 0.063
Cumulative Var 0.096 0.174 0.250 0.324 0.387

Right off the bat, this loading table looks a lot cleaner – items clearly load on one predominant factor, and the items seem to be magically grouped by letter. Spoiler: it was this very kind of analysis that originally lead psychometricians and personality researchers to conclude that there are five major dimensions to interpersonal differences: Agreeableness, Conscientiousness, Extraversion, Neuroticism (sometimes called emotional stability), and Openness. Each one of those terms has a precise technical definition that usually differs from how you might use the words in conversations. But that is a whole different story.

Now comes the most rewarding part of factor analysis*– figuring out a concise name for the factor, or construct, that can explain how and why people made the choices they did. This is much harder to do, by the way, if you have underspecified the number of factors that best fit this data.

*Funny, only psychometricians seem to think that factor analysis is ‘rewarding’.

This has really only been the tip of the iceberg — there is much more complexity involved in special kinds of rotation and factor extraction, bifactor solutions… Don’t even get me started on factor scores. Factor analysis can be a powerful technique, and is a great way of interpreting user behavior or opinions. The most important take away from this approach is that factor analysis lays bare the number of choices a research must make when utilizing statistical tools, and the number of choices is directly proportional to the number of opportunities for your brain to project itself on to your data. Other techniques that seem simpler have merely made these choices behind the scenes. None, however, have the storied history of factor analysis.


Evan Warfel is the founder and director of Delphy Research as well as the head of Science and Product at Life Partner Labs. He is available for consulting.

Share