-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is averaged pairwise correlation the first principle to construct MCCA indices? #190
Comments
Trying to understand this question but struggling a bit (perhaps a language barrier or perhaps terminology). In principle the first dimensions of the learnt representation of X,Y,Z call them Zx, Zy,Zz will have the highest average pairwise correlation (i.e. the average correlation of Zx[:,0] with Zy[:,0], Zx[:,0] with Zz[:,0], and Zy[:,0] with Zz[:,0]), the second dimensions will have the second highest (i.e. the average correlation of Zx[:,1] with Zy[:,1], Zx[:,1] with Zz[:,1], and Zy[:,1] with Zz[:,1]) etc. The package tests for that principle. If that isn't the case for your data let me know and if possible share the data and I'd be happy/curious to see what's going on. Explained Covariance is something I introduced here to understand the nature of correlated signals. High correlation + High covariance will generally be more robust than High correlation + Low covariance. |
I'd be most immediately surprised if the first dimension in your training data did not have the highest average correlation. The other observations with respect to explained covariance are entirely possible because a signal can be small but highly correlated versus big and less perfectly correlated. |
Hi James, thanks for your reply, and help me to confirm that the highest average pairwise correlation is the the first principle to construct CCA indices. I just tested to remove a variable in the X group, because this variable is generated using a same model framework as another variable in the Y group. Now I see the pairwise correlation of CCA indices go back to make senses (first-pair has the highest correlation). I guess the high dependence of two variables in two groups could somehow bias the CCA algorithm? |
Yeah I don't think it's biasing the algorithm so much as it perhaps makes some part of the solving process unstable. The default behaviour is to apply PCA to the data, run MCCA by solving a generalized eigenvalue problem on the principal components, and then "undo" the PCA (this process has some nice properties and is mathematically equivalent). It is possible that running MCCA(pca=False) on your original variables might also make the first dimension have highest correlation - not for any particular reason just because the solver might like it better. |
also just checking we are referring to training/in-sample correlations as opposed to testing/out-of-sample correlations. anything is possible out of sample! |
:) thanks for the reminder! |
I have 3 data views and each have more than 10 variables. I use the MCCA function.
I check the averaged pairwise correlation and explained covariance of constructed indices. The 2rd pair of indices has the largest explained covariance, and the 3rd pair of indices has the largest averaged pairwise correlation.
I'm wondering, in principle, CCA should construct the largest averaged pairwise correlation and explained covariance through the 1st pair. Do you maybe know why it is not the case? Thanks in advance!
Bests,
Wantong
I posted this question also here:
https://stackoverflow.com/posts/77723778/edit
But I don't have enough credit point to create a tag cca-zoo. So sorry for cross-posting but I'm afraid that my previous post does not notify you.
The text was updated successfully, but these errors were encountered: