{"title": "A Block-Coordinate Descent Approach for Large-scale Sparse Inverse Covariance Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 927, "page_last": 935, "abstract": "The sparse inverse covariance estimation problem arises in many statistical applications in machine learning and signal processing. In this problem, the inverse of a covariance matrix of a multivariate normal distribution is estimated, assuming that it is sparse. An $\\ell_1$ regularized log-determinant optimization problem is typically solved to approximate such matrices. Because of memory limitations, most existing algorithms are unable to handle large scale instances of this problem. In this paper we present a new block-coordinate descent approach for solving the problem for large-scale data sets. Our method treats the sought matrix block-by-block using quadratic approximations, and we show that this approach has advantages over existing methods in several aspects. Numerical experiments on both synthetic and real gene expression data demonstrate that our approach outperforms the existing state of the art methods, especially for large-scale problems.", "full_text": "A Block-Coordinate Descent Approach for\n\nLarge-scale Sparse Inverse Covariance Estimation\n\nEran Treister\u2217\u2020\n\nComputer Science, Technion, Israel\nand Earth and Ocean Sciences, UBC\nVancouver, BC, V6T 1Z2, Canada\neran@cs.technion.ac.il\n\nJavier Turek\u2217\n\nDepartment of Computer Science\n\nTechnion, Israel Institute of Technology\n\nTechnion City, Haifa 32000, Israel\njaviert@cs.technion.ac.il\n\nAbstract\n\nThe sparse inverse covariance estimation problem arises in many statistical appli-\ncations in machine learning and signal processing. In this problem, the inverse of a\ncovariance matrix of a multivariate normal distribution is estimated, assuming that\nit is sparse. An (cid:96)1 regularized log-determinant optimization problem is typically\nsolved to approximate such matrices. Because of memory limitations, most exist-\ning algorithms are unable to handle large scale instances of this problem. In this\npaper we present a new block-coordinate descent approach for solving the prob-\nlem for large-scale data sets. Our method treats the sought matrix block-by-block\nusing quadratic approximations, and we show that this approach has advantages\nover existing methods in several aspects. Numerical experiments on both syn-\nthetic and real gene expression data demonstrate that our approach outperforms\nthe existing state of the art methods, especially for large-scale problems.\n\nIntroduction\n\n1\nThe multivariate Gaussian (Normal) distribution is ubiquitous in statistical applications in machine\nlearning, signal processing, computational biology, and others. Usually, normally distributed ran-\ndom vectors are denoted by x \u223c N (\u00b5, \u03a3) \u2208 Rn, where \u00b5\u2208 Rn is the mean, and \u03a3\u2208 Rn\u00d7n is the\ncovariance matrix. Given a set of realizations {xi}m\ni=1, many such applications require estimating\nthe mean \u00b5, and either the covariance \u03a3 or its inverse \u03a3\u22121, which is also called the precision matrix.\nEstimating the inverse of the covariance matrix is useful in many applications [2] as it represents the\nunderlying graph of a Gaussian Markov Random Field (GMRF). Given the samples {xi}m\ni=1, both\nthe mean vector \u00b5 and the covariance matrix \u03a3 are often approximated using the standard maximum\nlikelihood estimator (MLE), which leads to \u02c6\u00b5 = 1\nm\n\ni=0 xi and1\n\n(cid:80)m\n\nm(cid:88)\n\ni=0\n\n(cid:52)\n= \u02c6\u03a3MLE =\n\nS\n\n1\nm\n\n(xi \u2212 \u02c6\u00b5)(xi \u2212 \u02c6\u00b5)T ,\n\n(1)\n\nwhich is also called the empirical covariance matrix. Speci\ufb01cally, according to the MLE, \u03a3\u22121 is\nestimated by solving the optimization problem\n\n(cid:52)\n= min\nA(cid:31)0\n\u2217The authors contributed equally to this work.\n\u2020Eran Treister is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship.\n1Equation (1) is the standard MLE estimator. However, sometimes the unbiased MLE estimation is pre-\n\n\u2212 log(det A) + tr(SA),\n\nmin\nA(cid:31)0\n\nf (A)\n\n(2)\n\nferred, where m \u2212 1 replaces m in the denominator.\n\n1\n\n\fwhich is obtained by applying \u2212 log to the probability density function of the Normal distribution.\nHowever, if the number of samples is lower than the dimension of the vectors, i.e., m < n, then\nS in (1) is rank de\ufb01cient and not invertible, whereas the true \u03a3 is assumed to be positive de\ufb01nite,\nhence full-rank. Still, when m < n one can estimate the matrix by adding further assumptions. It is\nwell-known [5] that if (\u03a3\u22121)ij = 0 then the random scalar variables in the i-th and j-th entries in x\nare conditionally independent. Therefore, in this work we adopt the notion of estimating the inverse\nof the covariance, \u03a3\u22121, assuming that it is sparse. (Note that in most cases \u03a3 is dense.) For this\npurpose, we follow [2, 3, 4], and minimize (2) with a sparsity-promoting (cid:96)1 prior:\n\n(cid:52)\n= min\nA(cid:31)0\n\nf (A) + \u03bb(cid:107)A(cid:107)1.\n\n(3)\n\nmin\nA(cid:31)0\n\nF (A)\n\nHere, f (A) is the MLE functional de\ufb01ned in (2), (cid:107)A(cid:107)1 \u2261(cid:80)\n\ni,j |aij|, and \u03bb > 0 is a regularization\nparameter that balances between the sparsity of the solution and the \ufb01delity to the data. The spar-\nsity assumption corresponds to a small number of statistical dependencies between the variables.\nProblem (3) is also called Covariance Selection [5], and is non-smooth and convex.\nMany methods were recently developed for solving (3)\u2014see [3, 4, 7, 8, 10, 11, 12, 15, 16] and ref-\nerences therein. The current state-of-the-art methods, [10, 11, 12, 16], involve a \u201cproximal Newton\u201d\napproach [20], where a quadratic approximation is applied on the smooth part f (A) in (3), leaving\nthe non-smooth (cid:96)1 term intact, in order to obtain the Newton descent direction. To obtain this, the\ngradient and Hessian of f (A) are needed and are given by\n\n\u2207f (A) = S \u2212 A\u22121,\n\n\u22072f (A) = A\u22121 \u2297 A\u22121,\n\n(4)\n\nwhere \u2297 is the Kronecker product. The gradient in (4) already shows the main dif\ufb01culty in solving\nthis problem: it contains A\u22121, the inverse of the sparse matrix A, which may be dense and expensive\nto compute. The advantage of the proximal Newton approach for this problem is the low overhead:\nby calculating the A\u22121 in \u2207f (A), we also get the Hessian at the same cost [11, 12, 16].\nIn this work we aim at solving large scale instances of (3), where n is large, such that O(n2) variables\ncannot \ufb01t in memory. Such problem sizes are required in fMRI [11] and gene expression analysis\n[9] applications, for example. Large values of n introduce limitations: (a) They preclude storing\nthe full matrix S in (1), and allow us to use only the vectors {xi}m\ni=1, which are assumed to \ufb01t in\nmemory. (b) While the sparse matrix A in (3) \ufb01ts in memory, its dense inverse does not. Because\nof this limitation, most of the methods mentioned above cannot be used to solve (3), as they require\ncomputing the full gradient of f (A), which is a dense n \u00d7 n symmetric matrix. The same applies\nfor the blocking strategies of [2, 7], which target the dense covariance matrix itself rather than\nits inverse, using the dual formulation of (3). One exception is the proximal Newton approach in\n[11], which was made suitable for large-scale matrices by treating the Newton direction problem in\nblocks.\nIn this paper, we introduce an iterative Block-Coordinate Descent [20] method for solving large-\nscale instances of (3). We treat the problem in blocks de\ufb01ned as subsets of columns of A. Each\nblock sub-problem is solved by a quadratic approximation, resulting in a descent direction that\ncorresponds only to the variables in the block. Since we consider one sub-problem at a time, we can\nfully store the gradient and Hessian for the block. In contrast, [11] applies a blocking approach to\nthe full Newton problem, which results in a sparse n\u00d7 n descent direction. There, all the columns of\nA\u22121 are calculated for the gradient and Hessian of the problem for each inner iteration when solving\nthe full Newton problem. Therefore, our method requires less calculations of A\u22121 than [11], which\nis the most computationally expensive task in both algorithms. Furthermore, our blocking strategy\nallows an ef\ufb01cient linesearch procedure, while [11] requires computing a determinant of a sparse\nn \u00d7 n matrix. Although our method is of linear order of convergence, it converges in less iterations\nthan [11] in our experiments. Note that the asymptotic convergence of [11] is quadratic only if the\nexact Newton direction is found at each iteration, which is very costly for large-scale problems.\n\n2\n\n\f1.1 Newton\u2019s Method for Covariance Selection\n\nThe proximal Newton approach mentioned earlier is iterative, and at each iteration k, the smooth part\nof the objective in (3) is approximated by a second order Taylor expansion around the k-th iterate\nA(k). Then, the Newton direction \u2206\u2217 is the solution of an (cid:96)1 penalized quadratic minimization\nproblem,\n\nmin\n\u2206\n\n\u02dcF (A(k) + \u2206) = min\n\u2206\n\nwhere W =(cid:0)A(k)(cid:1)\u22121 is the inverse of the k-th iterate. Note that the gradient and Hessian of f (A)\n\ntr(\u2206W\u2206W) + \u03bb(cid:107)A(k) + \u2206(cid:107)1,\n\nf (A(k)) + tr(\u2206(S \u2212 W)) +\n\n(5)\n\n1\n2\n\nin (4) are featured in the second and third terms in (5), respectively, while the \ufb01rst term of (5) is\nconstant and can be ignored. Problem (5) corresponds to the well-known LASSO problem [18],\nwhich is popular in machine learning and signal/image processing applications [6]. The methods of\n[12, 16, 11] apply known LASSO-solvers for treating the Newton direction minimization (5).\nOnce the direction \u2206\u2217 is computed, it is added to A(k) employing a linesearch procedure to suf-\n\ufb01ciently reduce the objective in (3) while ensuring positive de\ufb01niteness. To this end, the updated\niterate is A(k+1) = A(k) + \u03b1\u2217\u2206\u2217, and the parameter \u03b1\u2217 is obtained using Armijo\u2019s rule [1, 12]. That\nis, we choose an initial value of \u03b10, and a step size 0 < \u03b2 < 1, and accordingly de\ufb01ne \u03b1i = \u03b2i\u03b10.\nWe then look for the smallest i \u2208 N that satis\ufb01es the constraint A(k) + \u03b1i\u2206\u2217 (cid:31) 0, and the condition\n\nF (A(k) + \u03b1i\u2206\u2217) \u2264 F (A(k)) + \u03b1i\u03c3\n\ntr(\u2206\u2217(S \u2212 W)) + \u03bb(cid:107)A(k) + \u2206\u2217(cid:107)1 \u2212 \u03bb(cid:107)A(k)(cid:107)1\n\n.\n\n(6)\n\n(cid:105)\n\n(cid:104)\n\nThe parameters \u03b10, \u03b2, and \u03c3 are usually chosen as 1,0.5, and 10\u22124 respectively.\n\n1.2 Restricting the Updates to Active Sets\n\nAn additional signi\ufb01cant idea of [12] is to restrict the minimization of (5) at each iteration to an\n\u201cactive set\u201d of variables and keep the rest as zeros. The active set of a matrix A is de\ufb01ned as\n\nActive(A) =(cid:8)(i, j) : Aij (cid:54)= 0 \u2228 |(S \u2212 A\u22121)ij| > \u03bb(cid:9) .\n\nThis set comes from the de\ufb01nition of the sub-gradient of (3). In particular, as A(k) approaches\n\nthe solution A\u2217, Active(A(k)) approaches(cid:8)(i, j) : A\u2217\nij (cid:54)= 0(cid:9). As noted in [12, 16], restricting\n(5) to the variables in Active(cid:0)A(k)(cid:1) reduces the computational complexity: given the matrix W,\nK = |Active(cid:0)A(k)(cid:1)|. Hence, any method for solving the LASSO problem can be utilized to\nsolve (5) effectively while saving computations by restricting its solution to Active(cid:0)A(k)(cid:1). Our\nexperiments have veri\ufb01ed that restricting the minimization of (5) only to Active(cid:0)A(k)(cid:1) does not\n\nthe Hessian (third) term in (5) can be calculated in O(Kn) operations instead of O(n3), where\n\n(7)\n\nsigni\ufb01cantly increase the number of iterations needed for convergence.\n\n2 Block-Coordinate-Descent for Inverse Covariance (BCD-IC) Estimation\n\nIn this Section we describe our contribution. To solve problem (3), we apply an iterative Block-\nCoordinate-Descent approach [20]. At each iteration, we divide the column set {1, ..., n} into\nblocks. Then we iterate over all blocks, and in turn minimize (3) restricted to the \u201cactive\u201d vari-\nables of each block, which are determined according to (7). The other matrix entries remain \ufb01xed\nduring each update. The matrix A is updated after each block-minimization.\nWe choose our blocks as sets of columns because the portion of the gradient (4) that corresponds\nto such blocks can be computed as solutions of linear systems. Because the matrix is symmetric,\nthe corresponding rows are updated simultaneously. Figure 1 shows an example of a BCD iteration\nwhere the blocks of columns are chosen in sequential order. In practice, the sets of columns can\nbe non-contiguous and vary between the BCD iterations. We elaborate later on how to partition\n\n3\n\n\fFigure 1: Example of a BCD iteration. The blocks are treated successively.\n\nthe columns, and on some advantages of this block-partitioning. Partitioning the matrix into small\nblocks enables our method to solve (3) in high dimensions (up to millions of variables), requiring\nO(n2/p) additional memory, where p is the number of blocks (that is in addition to the memory\nneeded for storing the iterated solution A(k) itself).\n2.1 Block Coordinate Descent Iteration\nAssume that the set of columns {1, ..., n} is divided into p blocks {Ij}p\nj=1, where Ij is the set of\nindices that corresponds to the columns and rows in the j-th block. As mentioned before, in the\nBCD-IC algorithm we traverse all blocks and update the iterated solution matrix block by block.\nWe denote the updated matrix after treating the j-th block at iteration k by A(k)\nand the next iterate\nA(k+1) is de\ufb01ned once the last block is treated, i.e., A(k+1) = A(k)\np .\nTo treat each block of (3), we adopt both of the ideas described earlier: we use a quadratic approxi-\nmation to solve each block, while also restricting the updated entries to the active set. For simplicity\nof notation in this section, let us denote the updated matrix A(k)\nj\u22121, before treating block j at iteration\nk, by \u02dcA. To update block j, we change only the entries in the rows/columns in Ij. First, we form\nand minimize a quadratic approximation of problem (3), restricted to the rows/columns in Ij:\n\nj\n\n\u02dcF ( \u02dcA + \u2206j),\n\nmin\n\u2206j\n\n(8)\n\nwhere \u02dcF (\u00b7) is the quadratic approximation of (3) around \u02dcA, similarly to (5), and \u2206j has non-zero\nentries only in the rows/columns in Ij. In addition, the non-zeros of \u2206j are restricted to Active( \u02dcA)\nde\ufb01ned in (7). That is, we restrict the minimization (8) to\n\nActiveIj ( \u02dcA) = Active( \u02dcA) \u2229 {(i, k) : i \u2208 Ij \u2228 k \u2208 Ij} ,\n\n(9)\n\nwhile all other elements are set to zero for the entire treatment of the j-th block. To calculate this\nset, we check the condition in (7) only in the columns and rows of Ij. To de\ufb01ne this active set, and\nto calculate the gradient (4) for block Ij, we \ufb01rst calculate the columns Ij of \u02dcA\u22121, which is the\nmain computational task of our algorithm. To achieve that, we solve |Ij| linear systems, with the\ncanonical vectors el as right-hand-sides for each l \u2208 Ij, i.e., ( \u02dcA\u22121)Ij = \u02dcA\u22121EIj . The solution\nof these linear systems can be achieved in various ways. Direct methods may be applied using\nthe Cholesky factorization, which requires up to O(n3) operations. For large dimensions, iterative\nmethods such as Conjugate Gradients (CG) are usually preferred, because the cost of each iteration\nis proportional to the number of non-zeros in the sparse matrix. See Section A.4 in the Appendix\nfor details about the computational cost of this part of the algorithm.\n\n2.1.1 Treating a Block-subproblem by Newton\u2019s Method\nTo get the Newton direction for the j-th block, we solve the LASSO problem (8), for which there are\nmany available solvers [22]. We choose the Polak-Ribiere non-linear Conjugate Gradients (NLCG)\nmethod of [19] which, together with a diagonal preconditioner, was used to solve this problem in\n[22, 19]. We describe the NLCG algorithm in Apendix A.1. To use this method, we need to calculate\nthe objective of (8) and its gradient ef\ufb01ciently.\nThe calculation of the objective in (8) is much simpler than the full version in (5), because only\nblocks of rows/columns are considered. Denoting W = \u02dcA\u22121, to compute the objective in (8) and\nits gradient we need to calculate the matrices W\u2206jW and S \u2212 W only at the entries where \u2206j is\n\n4\n\n\fnon-zero (in the rows/columns in Ij). These matrices are symmetric, and hence, only their columns\nare necessary. This idea applies for the (cid:96)1 term of the objective in (8) as well.\nIn each iteration of the NLCG method, the main computational task involves calculating W\u2206jW in\nthe columns of Ij. For that, we reuse the Ij columns of \u02dcA\u22121 calculated for obtaining (9), which we\ndenote by WIj . Since we only need the result in the columns Ij, we \ufb01rst notice that (W\u2206jW)Ij\n=\nW\u2206jWIj , and the product \u2206jWIj can be computed ef\ufb01ciently because \u2206j is sparse.\nComputing W(\u2206jWIj ) is another relatively expensive part of our algorithm, and here we exploit\nthe restriction to the Active Set. That is, we only need to compute the entries in (9). For this, we\nfollow the idea of [11] and use the rows (or columns) of W that are represented in (9). Besides the\ncolumns Ij of W we also need the \u201cneighborhood\u201d of Ij de\ufb01ned as\n\nNj =(cid:8)i : \u2203k /\u2208 Ij : (i, k) \u2208 ActiveIj (A)(cid:9) .\n\n(10)\n\nThe size of this set will determine the amount of additional columns of W that we need, and there-\nfore we want it to be as small as possible. To achieve that, we de\ufb01ne the blocks {Ij} using clustering\nmethods, following [11]. We use METIS [13], but other methods may be used instead. The aim of\nthese methods is to partition the indices of the matrix columns/rows into disjoint subsets of rela-\ntively small size, such that there are as few as possible non-zero entries outside the diagonal blocks\nof the matrix that correspond to each subset. In our notation, we aim that the size of Nj will be as\nsmall as possible for every block Ij, and that the size of Ij will be small enough. Note that after\nwe compute WNj , we need to actually store and use only |Nj| \u00d7 |Nj| numbers out of WNj . How-\never, there might be situations where the matrix has a few dense columns, resulting in some sets Nj\nof size O(n). Computing WNj for those sets is not possible because of memory limitations. We\ntreat this case separately\u2014see Section A.2 in the Appendix for details. For a discussion about the\ncomputational cost of this part\u2014see Section A.4 in the Appendix.\n\nj = A(k)\n\nj\u22121 + \u03b1\u2217\u2206\u2217\n\nj is the Newton direction obtained by solving problem (8). Now we seek to update\nj, where \u03b1\u2217 > 0 is obtained by a linesearch procedure\n\n2.1.2 Optimizing the Solution in the Newton Direction with Line-search\nAssume that \u2206\u2217\nthe iterated matrix A(k)\nsimilarly to Equation (6).\nFor a general Newton direction matrix \u2206\u2217 as in (6), this procedure requires calculating the determi-\nnant of an n\u00d7n matrix. In [11], this is done by solving n\u22121 linear systems of decreasing sizes from\nn \u2212 1 to 1. However, since our direction \u2206\u2217\nj has a special block structure, we obtain a signi\ufb01cantly\ncheaper linesearch procedure compared to [11], assuming that the blocks Ij are relatively small.\nFirst, the trace and (cid:96)1 terms that are involved in the objective of (3) can be calculated with respect\nonly to the entries in the columns Ij (the rows are taken into account by symmetry). The log det\nterm, however, needs more special care, and is eventually reduced to calculating the determinant of\nan |Ij| \u00d7 |Ij| matrix, which becomes cheaper as the block size decreases. Let us introduce a parti-\ntioning of any matrix A into blocks, according to a set of indices Ij \u2286 {1, ..., n}. Assume without\nloss of generality that the rows and columns of A have been permuted such that the columns/rows\nwith indices in Ij appear \ufb01rst, and let\n\n\uf8ee\uf8ef\uf8ef\uf8f0 A11\n\nA21\n\nA =\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\nA12\n\nA22\n\n(11)\n\nbe a partitioning of A into four blocks. The sub-matrix A11 corresponds to the elements in rows\nIj and in columns Ij in \u02dcA. According to the Schur complement [17], for any invertible matrix and\nblock-partitioning as above, the following holds:\n\nlog det(A) = log det(A22) + log det(A11 \u2212 A12A\u22121\n\n22 A21).\n\n(12)\n\n5\n\n\fIn addition, for any symmetric matrix A the following applies:\n\n22 A21 (cid:31) 0.\nUsing the above notation for \u02dcA and the corresponding partitioning for \u2206\u2217\n\nA (cid:31) 0 \u21d4 A22 (cid:31) 0 and A11 \u2212 A12A\u22121\n\n(13)\n\nj, we write using (12):\n\nlog det ( \u02dcA + \u03b1\u2206j) = log det ( \u02dcA22) + log det(B0 + \u03b1B1 + \u03b12B2)\n\n(14)\n\n22\n\n22\n\n22 \u220621, and\n\n\u02dcA21, B1 = \u220611 \u2212 \u220612 \u02dcA\u22121\n\n22 \u220621. (Note that here we replaced \u2206\u2217\n\n\u02dcA21 \u2212 \u02dcA12 \u02dcA\u22121\nj by \u2206 to ease notation.)\n\nwhere B0 = \u02dcA11 \u2212 \u02dcA12 \u02dcA\u22121\nB2 = \u2212\u220612 \u02dcA\u22121\nj (cid:31) 0 involved in the linesearch (6) is equivalent\nFinally, the positive de\ufb01niteness condition \u02dcA+\u03b1\u2217\u2206\u2217\nto B0 + \u03b1B1 + \u03b12B2 (cid:31) 0, assuming that \u02dcA22 (cid:31) 0, following (13). Throughout the iterations, we\nalways guarantee that our iterated solution matrix \u02dcA remains positive de\ufb01nite by linesearch in every\nupdate. This requires that the initialization of the algorithm, A(0), be positive de\ufb01nite. If the set\nIj is relatively small, then the matrices Bi in (14) are also small (|Ij| \u00d7 |Ij|), and we can easily\ncompute the objective F (\u00b7), and apply the Armijo rule (6) for \u2206\u2217\nj. Calculating the matrices Bi\nin (14) seems expensive, however, as we show in Appendix A.3, they can be obtained from the\npreviously computed matrices WIj and WNj mentioned earlier. Therefore, computing (14) can be\nachieved in O(|Ij|3) time complexity.\n\nAlgorithm: BCD-IC(A(0),{xi}m\nfor k = 0, 1, 2, ... do\n\ni=1,\u03bb)\nCalculate clusters of elements {Ij}p\n% Denote: A(k)\nfor j = 1, ..., p do\n\n0 = A(k)\n\nj=1 based on A(k).\n\n(cid:16)\n(cid:16)\n(cid:16)\n\nj\u22121)\u22121(cid:17)\n(cid:17)\nj\u22121)\u22121(cid:17)\n\nIj\n\nNj\n\nCompute WIj =\n\n(A(k)\n\n. % solve |Ij| linear systems\n\nA(k)\nj\u22121\n(A(k)\n\nDe\ufb01ne ActiveIj\nCompute WNj =\nFind the Newton direction \u2206\u2217\nUpdate the solution: A(k)\n\nas in (9), and de\ufb01ne the set Nj in (10).\n\n. % solve |Nj| linear systems\n\nj by solving the LASSO problem (8).\nj\u22121 + \u03b1\u2217\u2206\u2217\n\nj by linesearch.\n\nj = A(k)\n\nend\n% Denote: A(k+1) = A(k)\n\np\n\nend\n\nAlgorithm 1: Block Coordinate Descent for Inverse Covariance Estimation\n\n3 Convergence Analysis\nIn this Section, we elaborate on the convergence of the BCD-IC algorithm to the global optimum\nof (3). We base our analysis on [20, 12].\nIn [20], a general block-coordinate-descent approach\nis analyzed to solve minimization problems of the form F (A) = f (A) + \u03bbh(A) composed of\nthe sum of a smooth function f (\u00b7) and a separable convex function h(\u00b7), which in our case are\n\u2212 log det(A) + tr(SA) and (cid:107)A(cid:107)1, respectively. Although this setup \ufb01ts the functional F (A) in (3),\n[20] treats the problem in the Rn\u00d7n domain, while the minimization in (3) is being constrained over\nSn\n++\u2014the symmetric positive de\ufb01nite matrices domain. To overcome this limitation, the authors in\n[12] extended the analysis in [20] to treat the speci\ufb01c constrained problem (3).\nIn particular, [20, 12] consider block-coordinate-descent methods where in each step t a subset Jt\nof variables is updated. Then, a Gauss-Seidel condition is necessary to ensure that all variables are\nupdated every T steps:\n\nJl+t \u2287 N \u2200t = 1, 2, . . . ,\n\n(15)\n\n(cid:91)\n\nl=0,...,T\u22121\n\n6\n\n\fwhere N is the set of all variables, and T is a \ufb01xed number. Similarly to [12], treating each block\nof columns Ij in the BCD-IC algorithm is equivalent to updating the elements outside the active set\nActiveIj (A), followed by an update of the elements in ActiveIj (A). Therefore, in (15), we set\n\nJ2t = {(i, l) : i \u2208 Ij \u2228 l \u2208 Ij} \\ ActiveIj ( \u02dcA),\n\nJ2t+1 = ActiveIj ( \u02dcA),\n\nwhere the step index t corresponds to the block j at the iteration k of BCD-IC. In [12, Lemma\n1], it is shown that setting the elements outside the active set for block j to zero satis\ufb01es the opti-\nmality condition of that step. Therefore, in our algorithm we only need to update the elements in\nActiveIj (A). Now, if we were using p \ufb01xed blocks containing all the coordinates of A in Algo-\nrithm (1) (no clustering is applied), then the Gauss-Seidel condition (15) would be satis\ufb01ed every\nT = 2p blocks. When clustering is applied, the block-partitioning {Ij} can change at every acti-\nvation of the clustering method. Therefore, condition (15) is satis\ufb01ed at most after T = 4\u02dcp, where\n\u02dcp is the maximum number of blocks obtained from all the activations of the clustering algorithm.\nFor completeness, we include in Appendix A.5 the lemmas in [12] and the proof of the following\ntheorem:\nTheorem 1. In Algorithm 1, the sequence\n\nconverges to the global optimum of (3).\n\n(cid:110)\n\n(cid:111)\n\nA(k)\n\nj\n\n4 Numerical Results\nIn this section we demonstrate the ef\ufb01ciency of the BCD-IC method, and compare it with other\nmethods for both small and large scales. For small-scale problems we include QUIC [12], BIG-\nQUIC [11] and G-ISTA [8], which are the state-of-the-art methods at this scale. For large-scale\nproblems, we compare our method only with BIG-QUIC as it is the only feasible method known\nto us at this scale. For all methods, we use the original code which was provided by the authors\u2014\nall implemented in C and parallelized (except QUIC which is partially parallelized). Our code for\nBCD-IC is MATLAB based with several routines in C. All the experiments were run on a machine\nwith 2 Intel Xeon E-2650 2.0GHz processors with 16 cores and 64GB RAM, using Windows 7 OS.\nAs a stopping criterion for BCD-IC, we use the rule as in [11]: (cid:107)gradSF (A(k))(cid:107)1 < \u0001(cid:107)A(k)(cid:107)1,\nwhere gradSF (\u00b7) is the minimal norm subgradient, de\ufb01ned in Equation (25) in Appendix A.5. For\n\u0001 = 10\u22122 as we choose, this results in the entries in A(k) being about two digits accurate compared\nto the true solution \u03a3\u22121\u2217\n. As in [11], we approximate WIj and WNj by using CG, which we\nstop once the residual drops below 10\u22125 and 10\u22124, respectively. For stopping NLCG (Algorithm 2)\nwe use \u0001nlcg = 10\u22124 (see details at the end of Section A.1). We note that for the large-scale test\nproblems, BCD-IC with optimal block size requires less memory than BIG-QUIC.\n\n(cid:105)\n\n(cid:104)\n\n4.1 Synthetic Experiments\nWe use two synthetic experiments to compare the performance of the methods. First, the random\nmatrix from [14], which is generated to have about 10 non-zeros per row, and to be well-conditioned.\nWe generate matrices of sizes n varying from 5,000 to 160,000, and generate 200 samples for each\n(m = 200). The values of \u03bb are chosen so that the solution \u03a3\u22121\u2217\nhas approximately 10n non-zeros.\nBCD-IC is run with block sizes of 64, 96, 128, 256, and 256 for each of the random tests in Table\n1, respectively. The second problem is a 2D version of the chain example in [14], which can be\n, applied on a square lattice. \u03bb is chosen such that \u03a3\u22121\u2217\nrepresented as the 2D stencil 1\n4\nhas about 5n non-zeros. For these tests, BCD-IC is run with block size of 1024.\nTable 1 summarizes the results for this test case. The results show that for small-scale problems,\nG-ISTA is the fastest method and BCD-IC is just behind it. However, from size 20,000 and higher,\nBCD-IC is the fastest. We could not run QUIC and G-ISTA on problems larger than 20,000 because\nof memory limitations. The time gap between G-ISTA and both BCD-IC and BIG-QUIC in small-\nscales can be reduced if their programs receive the matrix S as input instead of the {xi}m\ni=1.\n4.2 Gene Expression Analysis Experiments\nFor the large-scale real-world experiments, we use gene expression datasets that are available at the\nGene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/). We use several of the\n\n\u22121 5 \u22121\n\n\u22121\n\u22121\n\n7\n\n\ftest, n\n\n(cid:107)\u03a3\u22121(cid:107)0\n59,138\nrandom 5K\n118,794\nrandom 10K\n237,898\nrandom 20K\n475,406\nrandom 40K\nrandom 80K\n950,950\nrandom 160K 1,901,404\n1,248,000\n2,503,488\n4,996,000\n\nG-ISTA\n13.6s(7)\n60.2s(7)\n491s(8)\n*\n*\n*\n*\n*\n*\nTable 1: Results for the random and 2D synthetic experiments. (cid:107)\u03a3\u22121(cid:107)0 and (cid:107)\u03a3\u22121\u2217(cid:107)0 denote the number of\nnon-zeros in the true and estimated inverse covariance matrices, respectively. For each run, timings are reported\nin seconds and number of iterations in parentheses. \u2018*\u2019 means that the algorithm ran out of memory.\n\nBIG-QUIC\n19.6s(5)\n73.8s(5)\n673s(5)\n2,671s(5)\n16,764s(5)\n25,584s(4)\n40,530s(4)\n203,370s(4)\n1,220,213s(4)\n\nBCD-IC\n15.3s(3)\n61.8s(3)\n265s(3)\n729s(4)\n4,102s(4)\n21,296s(4)\n24,235s(4)\n130,636s(4)\n777,947s(4)\n\n(cid:107)\u03a3\u22121\u2217(cid:107)0\n63,164\n139,708\n311,932\n423,696\n891,268\n1,852,198\n1,553,698\n3,002,338\n5,684,306\n\nQUIC\n28.7s(5)\n114s(5)\n823s(5)\n*\n*\n*\n*\n*\n*\n\n\u03bb\n0.22\n0.23\n0.24\n0.26\n0.27\n0.28\n0.30\n0.31\n0.32\n\n2D 5002\n2D 7082\n2D 10002\n\ntests reported in [9]. The data is preprocessed to have zero mean and unit variance for each variable\n(i.e., diag(S) = I). Table 2 shows the datasets as well as the numbers of variables (n) and samples\n(m) on each. In particular, these datasets have many variables and very few samples (m (cid:28) n).\nBecause of the size of the problems, we ran only BCD-IC and BIG-QUIC for these test cases.\nFor the \ufb01rst three tests in Table 2, \u03bb was chosen so that the solution matrix has about 10n non-zeros.\nFor the fourth test, we choose a relatively high \u03bb = 0.9 since the low number of samples causes the\nsolutions with smaller \u03bb\u2019s to be quite dense. BCD-IC is run with block size of 256 for all the tests\nin Table 2. We found these datasets to be more challenging than the synthetic experiments above.\nStill, both algorithms BCD-IC and BIG-QUIC manage to estimate the inverse covariance matrix in\nreasonable time. As in the synthetic case, BCD-IC outperforms BIG-QUIC in all test cases. BCD-IC\nrequires a smaller number of iterations to converge, which translates into shorter timings. Moreover,\nthe average time of each BCD-IC iteration is faster than that of BIG-QUIC.\n\nn\n\nm\n\nDescription\nLiver cancer\nBreast cancer\nProstate cancer\nLiver cancer\n\nBIG-QUIC\ncode name\nGSE1898\n5,079.5s (12)\nGSE20194\n2,810.6s (10)\nGSE17951\n8,229.7s\n(9)\nGSE14322\n127,199s (14)\nTable 2: Gene expression results. (cid:107)\u03a3\u22121\u2217(cid:107)0 denotes the number of non-zeros in the estimated covariance\nmatrix. For each run, timings are reported in seconds and number of iterations in parentheses.\n\nBCD-IC\n788.3s (7)\n452.9s (8)\n1,621.9s (6)\n55,314.8s (9)\n\n21, 794\n22, 283\n54, 675\n104, 702\n\n\u03bb\n0.7\n0.7\n0.78\n0.9\n\n182\n278\n154\n76\n\n(cid:107)\u03a3\u22121\u2217(cid:107)0\n293,845\n197,953\n558,929\n4,973,476\n\n5 Conclusions\n\nIn this work we introduced a Block-Coordinate Descent method for solving the sparse inverse co-\nvariance problem. Our method has a relatively low memory footprint, and therefore it is especially\nattractive for solving large-scale instances of the problem. It solves the problem by iterating and up-\ndating the matrix block by block, where each block is chosen as a subset of columns and respective\nrows. For each block sub-problem, a proximal Newton method is applied, requiring a solution of a\nLASSO problem to \ufb01nd the descent direction. Because the update is limited to a subset of columns\nand rows, we are able to store the gradient and Hessian for each block, and enjoy an ef\ufb01cient line-\nsearch procedure. Numerical results show that for medium-to-large scale experiments our algorithm\nis faster than the state-of-the-art methods, especially when the problem is relatively hard.\nAcknowledgement: The authors would like to thank Prof. Irad Yavneh for his valuable comments\nand guidance throughout this work. The research leading to these results has received funding from\nthe European Union\u2019s - Seventh Framework Programme (FP7/2007-2013) under grant agreement no\n623212 MC Multiscale Inversion.\n\n8\n\n\fReferences\n\n[1] L. Armijo. Minimization of functions having lipschitz continuous \ufb01rst partial derivatives. Paci\ufb01c Journal\n\nof Mathematics, 16(1):1\u20133, 1966.\n\n[2] O. Banerjee, L. El Ghaoui, and A. d\u2019Aspremont. Model selection through sparse maximum likelihood\nestimation for multivariate gaussian or binary data. J. of Machine Learning Research, 9:485\u2013516, 2008.\n\n[3] O. Banerjee, L. El Ghaoui, A. d\u2019Aspremont, and G. Natsoulis. Convex optimization techniques for \ufb01tting\n\nsparse gaussian graphical models. In Proceedings of the 23rd ICML, pages 89\u201396. ACM, 2006.\n\n[4] A. d\u2019Aspremont, O. Banerjee, and L. El Ghaoui. First-order methods for sparse covariance selection.\n\nSIAM Journal on Matrix Analysis and App., 30(1):56\u201366, 2008.\n\n[5] A. P. Dempster. Covariance selection. Biometrics, pages 157\u2013175, 1972.\n\n[6] M. Elad. Sparse and redundant representations: from theory to applications in signal and image process-\n\ning. Springer, 2010.\n\n[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the\n\ngraphical lasso. Biostatistics, 9(3):432\u2013441, 2008.\n\n[8] D. Guillot, B. Rajaratnam, B. T. Rolfs, A. Maleki, and I. Wong. Iterative thresholding algorithm for sparse\n\ninverse covariance estimation. NIPS, Lake Tahoe CA, 2012.\n\n[9] J. Honorio and T. S. Jaakkola. Inverse covariance estimation for high-dimensional data in linear time and\n\nspace: Spectral methods for riccati and sparse models. In Proc. of the 29th Conference on UAI, 2013.\n\n[10] Cho-Jui Hsieh, Inderjit Dhillon, Pradeep Ravikumar, and Arindam Banerjee. A divide-and-conquer\n\nmethod for sparse inverse covariance estimation. In NIPS 25, pages 2339\u20132347, 2012.\n\n[11] Cho-Jui Hsieh, Matyas A Sustik, Inderjit Dhillon, Pradeep Ravikumar, and Russell Poldrack. Big & Quic:\n\nSparse inverse covariance estimation for a million variables. In NIPS 26, pages 3165\u20133173, 2013.\n\n[12] Cho-Jui Hsieh, Matyas A Sustik, Inderjit S Dhillon, and Pradeep D Ravikumar. Sparse inverse covariance\n\nmatrix estimation using quadratic approximation. In NIPS 24, pages 2330\u20132338, 2011.\n\n[13] George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular\n\ngraphs. SIAM Journal on Scienti\ufb01c Computing, 20(1):359\u2013392, 1998.\n\n[14] Lu Li and Kim-Chuan Toh. An inexact interior point method for l-1 regularized sparse covariance selec-\n\ntion. Mathematical Programming Computation, 2(3-4):291\u2013315, 2010.\n\n[15] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for large-scale\n\ngraphical lasso. The Journal of Machine Learning Research, 13:781\u2013794, 2012.\n\n[16] Peder A Olsen, Figen \u00a8oztoprak, Jorge Nocedal, and Steven J Rennie. Newton-like methods for sparse\n\ninverse covariance estimation. In NIPS 25, pages 764\u2013772, 2012.\n\n[17] Y. Saad. Iterative methods for sparse linear systems, 2nd edition. SIAM, 2003.\n\n[18] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 267\u2013288, 1996.\n\n[19] E. Treister and I. Yavneh. A multilevel iterated-shrinkage approach to l1 penalized least-squares mini-\n\nmization. Signal Processing, IEEE Transactions on, 60(12):6319\u20136329, 2012.\n\n[20] Paul Tseng and Sangwoon Yun. A coordinate gradient descent method for nonsmooth separable mini-\n\nmization. Mathematical Programming, 117(1-2):387\u2013423, 2009.\n\n[21] Z. Wen, W. Yin, D. Goldfarb, and Y. Zhang. A fast algorithm for sparse reconstruction based on shrinkage,\n\nsubspace optimization and continuation. SIAM Sci. Comp., 32(4):1832\u20131857, 2010.\n\n[22] M. Zibulevsky and M. Elad. L1-l2 optimization in signal and image processing. Signal Processing\n\nMagazine, IEEE, 27(3):76\u201388, May 2010.\n\n9\n\n\f", "award": [], "sourceid": 582, "authors": [{"given_name": "Eran", "family_name": "Treister", "institution": "University Of British Columbia"}, {"given_name": "Javier", "family_name": "Turek", "institution": "Technion"}]}