Share this post on:

Nd easily interpretable. There are two alternative interpretations. The first is the connection to CCA discussed above. The second is that it extends the standard practice of standardizing the mean and variance of each variable separately beforePage 2 of(page number not for citation purposes)BMC Bioinformatics 2008, 9:http://www.biomedcentral.com/1471-2105/9/dimensionality reduction. Now each data source is standardized instead of each variable. We have developed a practical software tool for R that incorporates the subtle but crucial choices that need to be made to choose the dimensionality of the solution. The method is demonstrated on three collections of gene expression measurements. A kernelized version of CCA (KCCA) has been used in specific data fusion tasks (see e.g. [6,7]) and it could be easily extended to be used in the same way as the linear CCA here. We will focus on the linear mappings for two practical reasons: Computation of the linear version is fast and the components are more easily interpretable. In particular, the kernelized version does not reveal which of the original features cause the dependencies between sources.catenated whitened data sets. Since all the within-data structure PCA could extract has been removed, it can only find variation shared by at least two of the data sets, and the maximum variance directions it searches for correspond to the highest between-data correlations. Formally, applying PCA to the columnwise concatenation of the whitened data sets Z = [X 1 ,…, X p ] yields the factorization CZ = V V T , (1)where the orthonormal matrix V contains the eigenvectors, is a diagonal matrix of projection variances, and CZ is the covariance matrix of Z. Projecting Z onto the first d eigenvectors Vd corresponding to the d largest eigenvalues gives the d principal components, which are the optimal d-dimensional representation in terms of the shared variance. The whole data PubMed ID:https://www.ncbi.nlm.nih.gov/pubmed/25768400 collection becomes integrated into Pd = ZVd, (2)Results and DiscussionAlgorithm In this section we first explain a simple two-step procedure, based on whitening and PCA, for finding the aspects shared by the sources, and then show how the same fusion solution can equivalently be derived from the result of applying a generalized CCA to the collection. The two-step procedure ML240 side effects provides the intuition for the approach: First remove the within-data variation, and then capture all the variation that is still remaining. The connection to CCA then demonstrates how the procedure provides a solution to the issue of combining the separate components CCA gives.Denote a collection of p data sets by X1,…,Xp, where each Xi is a m ?ni matrix such that m N, and N = ni. The rows of the matrices correspond to the same object in each set, while the columns correspond to features that need not be the same in the data sets. For example, in traditional expression analyses the rows would be genes and the columns would be conditions, treatments, time points, etc. For notational simplicity, we assume zero mean data. In the first step, each data set is whitened to remove all within-data correlations, and the data are scaled so that all dimensions have equal variance. The whitened versionwhere Pd is of size m ?d and contains a d-dimensional feature vector for each of the analyzed objects. The idea is then simply to use this new representation for any further analysis, which can be made using any method that operates on vectorial data. The whole procedure can.

Share this post on: