Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is averaged pairwise correlation the first principle to construct MCCA indices? #190

Open
WantongLi123 opened this issue Dec 27, 2023 · 6 comments

Comments

@WantongLi123
Copy link

I have 3 data views and each have more than 10 variables. I use the MCCA function.

I check the averaged pairwise correlation and explained covariance of constructed indices. The 2rd pair of indices has the largest explained covariance, and the 3rd pair of indices has the largest averaged pairwise correlation.

I'm wondering, in principle, CCA should construct the largest averaged pairwise correlation and explained covariance through the 1st pair. Do you maybe know why it is not the case? Thanks in advance!

Bests,
Wantong


I posted this question also here:
https://stackoverflow.com/posts/77723778/edit
But I don't have enough credit point to create a tag cca-zoo. So sorry for cross-posting but I'm afraid that my previous post does not notify you.

@jameschapman19
Copy link
Owner

Trying to understand this question but struggling a bit (perhaps a language barrier or perhaps terminology).

In principle the first dimensions of the learnt representation of X,Y,Z call them Zx, Zy,Zz will have the highest average pairwise correlation (i.e. the average correlation of Zx[:,0] with Zy[:,0], Zx[:,0] with Zz[:,0], and Zy[:,0] with Zz[:,0]), the second dimensions will have the second highest (i.e. the average correlation of Zx[:,1] with Zy[:,1], Zx[:,1] with Zz[:,1], and Zy[:,1] with Zz[:,1]) etc.

The package tests for that principle. If that isn't the case for your data let me know and if possible share the data and I'd be happy/curious to see what's going on.

Explained Covariance is something I introduced here to understand the nature of correlated signals. High correlation + High covariance will generally be more robust than High correlation + Low covariance.
But (M)CCA optimises for correlation not covariance therefore the first dimension may have a higher correlation and lower covariance than the second dimension. In test data (out of sample), it might even be the case that the first dimension has lower correlation.

@jameschapman19
Copy link
Owner

I'd be most immediately surprised if the first dimension in your training data did not have the highest average correlation. The other observations with respect to explained covariance are entirely possible because a signal can be small but highly correlated versus big and less perfectly correlated.

@WantongLi123
Copy link
Author

Hi James, thanks for your reply, and help me to confirm that the highest average pairwise correlation is the the first principle to construct CCA indices.

I just tested to remove a variable in the X group, because this variable is generated using a same model framework as another variable in the Y group. Now I see the pairwise correlation of CCA indices go back to make senses (first-pair has the highest correlation).

I guess the high dependence of two variables in two groups could somehow bias the CCA algorithm?

@jameschapman19
Copy link
Owner

jameschapman19 commented Dec 28, 2023

Yeah I don't think it's biasing the algorithm so much as it perhaps makes some part of the solving process unstable.

The default behaviour is to apply PCA to the data, run MCCA by solving a generalized eigenvalue problem on the principal components, and then "undo" the PCA (this process has some nice properties and is mathematically equivalent).

It is possible that running MCCA(pca=False) on your original variables might also make the first dimension have highest correlation - not for any particular reason just because the solver might like it better.

@jameschapman19
Copy link
Owner

also just checking we are referring to training/in-sample correlations as opposed to testing/out-of-sample correlations.

anything is possible out of sample!

@WantongLi123
Copy link
Author

:) thanks for the reminder!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants