Factorized solution of generalized stable Sylvester equations using many-core GPU accelerators

We investigate the factorized solution of generalized stable Sylvester equations such as those arising in model reduction, image restoration, and observer design. Our algorithms, based on the matrix sign function, take advantage of the current trend to integrate high performance graphics accelerators (also known as GPUs) in computer systems. As a result, our realisations provide a valuable tool to solve large-scale problems on a variety of platforms.

where A, E ∈ IR n×n , B, D ∈ IR m×m , F ∈ IR n× p , G ∈ IR p×m , and X ∈ IR n×m is the sought-after solution. Equation (1) has a unique solution if and only if α + β = 0 for all α ∈ Λ (A, E) and β ∈ Λ (B, D), where Λ (U , V ) denotes the generalized eigenspectrum of the matrix pencil U − λV . In particular, this property holds for generalized stable Sylvester equations, where both Λ (A, E) and Λ (B, D) lie in the open left half plane. Sylvester equations have numerous applications in control theory, signal processing, filtering, model reduction, image restoration, etc., see, e.g., [1] and the references therein. In particular, in model reduction using cross-Gramians [2][3][4][5], image restoration [6], and observer design [7], p n, m and the solution X often exhibits a low (numerical) rank [8]. In such cases, it is beneficial to compute the factorized solution of the equation, from the perspectives of numerical accuracy as well as computational cost. Thus, we aim at computing a pair of matrices Y and Z T , both with a small number of columns, such that X = Y Z. Algorithms for the standard case (E = I n , D = I m ) were first suggested in [1,9].
The factorized solution of Eq. (1) involves a considerable computational effort (in terms of arithmetic operations), asking for the application of high performance techniques when large problems are to be solved. To address this problem, following the general adoption of graphics processing units (GPUs) as powerful and ubiquitous parallel co-processors, in this work we propose efficient realizations based on the matrix sign function that are specifically designed to exploit this type of accelerators. Our experimental result confirm the efficacy of the sign function-based solvers and report a considerable performance advantage for this type of GPU-based solver.
Several previous works on parallel algorithms for solving linear matrix equations proposed implementations of an algorithm based on the (Hessenberg-)Schur decomposition of the coefficient matrices. As this is usually done using (Sca)LAPACK routines, the work in [10][11][12] concentrates on the solution of (quasi-)triangular Sylvester equations. The work in [12] focuses on task-parallelism, but not on GPU accelerators, and does not consider the generalized variant of the Sylvester equation targeted here; furthermore, [10,11] study the symmetric case of Lyapunov equations. The solution process in all three papers is not related to the iterative low-rank solver approach proposed in our paper, and the parallel performance of these solvers is limited by that of the QZ algorithm for the initial stage of the solution procedure. In comparison, we avoid the QZ algorithm completely and adhere to a matrix multiplication rich method that leverages the low-rank structure of the right-hand side for memory and computational savings. The paper [13] discusses an iterative scheme that treats the standard case of (1) without a factorized right-hand side, but from the timings and accuracy results listed in that paper, is not competitive with our approach. Our previous work on parallel and GPU-accelerated Lyapunov and Sylvester solvers summarized in [14] and further developed in [15] did not consider the generalized and factorized Sylvester case (1). As an additional contribution, we improve our GPU-enabled routine of the factorized solver and include two more variants, a hybrid CPU-GPU version and a dual-GPU version.
The rest of the paper is structured as follows. In Sect. 2, we briefly review the classical sign function solver for the generalized Sylvester equation, discuss the factored iteration, and propose an initial transformation of the equation that considerably reduces the cost per iteration. Then, in Sect. 3, we provide some details on how the Sylvester equation solvers are parallelized using many-core strategies. Numerical experiments reporting the accuracy and the high performance of the new methods on a hardware platform equipped with GPUs are presented in Sect. 4. The final section summarizes the findings in this paper and gives some concluding remarks.

Iterative schemes for generalized stable Sylvester equations
Traditional solvers for generalized Sylvester equations consist of generalizations of the Bartels-Stewart (BS) method [16] and the Hessenberg-Schur method [17,18]. A different approach is to rely on iterative schemes for the computation of the matrix sign function. We adapt the basic Newton iteration used in this context [19] to solve the generalized formulations shown in Eq. (1) and obtain the solution in factored form. Similar algorithms have been proposed for the standard Sylvester equation in [1] and for the Lyapunov equation in [20,21]. We also propose an initial transformation of the equation that further reduces the cost of both the classical and the factored iterations.

Theoretical background
Consider a matrix M ∈ IR l×l with no eigenvalues on the imaginary axis, and let M = S J − 0 0 J + S −1 be its Jordan decomposition. The Jordan blocks in J − ∈ IR t×t and J + ∈ IR (l−t)×(l−t) contain, respectively, the stable and unstable parts of Λ (M, I l ).
Under the given assumptions, the sequence {M k } ∞ k=0 converges to sign (M) = lim k→∞ M k [19], with an ultimately quadratic convergence rate. As the initial convergence may be slow, the use of acceleration techniques such as those in [22,23] is recommended. If X is a solution of Eq. (1), the similarity transformation defined by Using sign H , the relation given in (4), and the property of the sign function sign T −1H T = T −1 sign H T , we derive Eq. (5) for the solution of Eq. (1).
This relation forms the basis of the numerical algorithm derived next since it states that we can solve (1) by computing the matrix sign function ofH in (3).  (2) toH . In doing so, we obtain in the first step Repeating this calculation and denoting H 0 := H = LH M, we arrive at so that H k = LH k M. Finally, taking limits on both sides, yields and

Solution of the generalized Sylvester equation
In [1] it is observed that exploiting the block-triangular structure of the matrix pencil H − λK , we obtain the following classical generalized Newton iteration for the solution of the generalized Sylvester equation (1): At convergence, the solution of (1) is computed by solving the linear equation Also, from (8) we have lim k→∞ A k = −E and lim k→∞ B k = −D, which suggests the stopping criterion where τ is a tolerance threshold. One might choose τ = γ ε for the machine precision ε and, for instance, γ = n or γ = 10 √ n. However, as the terminal accuracy sometimes cannot be reached, in order to avoid stagnation it is better to choose τ = √ ε and to perform 1 to 3 additional iterations once this criterion is satisfied. Due to quadratic convergence of the Newton iteration (2), this is usually sufficient to reach the attainable accuracy, as already suggested and explained in the context of sign function based matrix equation solvers in [20].

Factored solution of the generalized Sylvester equation
In order to obtain a factorized solution of (1), we rewrite the iteration for C k as two separate iterations: Although this iteration is cheaper during the initial steps if p n, m, this advantage is lost as the iteration advances since the number of columns in F k+1 and the number of rows in G k+1 is doubled in each iteration step. This can be avoided by applying a similar technique as that employed in [20] for the factorized solution of generalized Lyapunov equations. Let F k ∈ IR n× p k and G k ∈ IR p k ×m . We first compute a rank-revealing QR (RRQR) factorization [24] of G k+1 as defined above; that is, we calculate where U is orthogonal, Π G is a permutation matrix, and R is upper triangular with R 1 ∈ IR r ×m of full row-rank. Then, we compute a RRQR factorization of F k+1 U : where V is orthogonal, Π F is a permutation matrix, and T is upper triangular with we then get as the new iterates we obtain the solution (1) in factored form X = Y Z. If X has low numerical rank, the factors Y and Z will have a low number of columns and rows, respectively, and the storage space and computation time needed for the factored iteration will be lower than that of the classical iteration. In such case, r , t m, n, and the cost of the current iteration for the factorized solution is 14 3

Numerical performance
We next analyze the accuracy of the new Sylvester solvers borrowing examples from [25,26]. We use ieee double-precision floating-point arithmetic with machine precision ε ≈ 2.2204 × 10 −16 . For the numerical evaluation, we implemented two MATLAB functions: ggcsne: The classical generalized Newton iteration for the generalized Sylvester equation in factored form as given in (9). ggcsnc: The factored variant of the generalized Newton iteration for the generalized Sylvester equation in factored form.
We compare these functions with the BS method as implemented in function lyap from the MATLAB Control Toolbox. As the BS solver in MATLAB cannot deal with the generalized Sylvester equation, we apply it to the transformed standard Sylvester Example 1 A basic test case aimed to compute the cross-Gramian matrix W co of a generalized linear time-invariant system of the form This matrix is given by the solution of the generalized Sylvester equation and W co =Ŵ co M. The cross-Gramian contains information of certain properties of the linear system [4] and can also be used for model reduction [2]. We employ the solvers to computeŴ co for a system described in [25,Example 4.2] which comes from a model for heat control in a thin rod. The physical process is modeled by a linear-quadratic optimal control problem for the instationary 1D heat equation. Semidiscretization in space using finite elements leads to a system of the form (11), where M and K are the mass matrix and stiffness matrix, respectively, of the finite element approximation. Mesh refinement leads to systems of different orders n. The other parameters in this example are set to a = 0.01, b = 2, c = 1, β 1 = 0, β 2 = 0.1, γ 1 = 0.9, γ 2 = 1.
The left-hand side plot in Fig. 1 shows the results for various problem dimensions. As a measure of the quality of the solutions, we report the relative residuals whereŴ * co denotes the computed solution. For this example, the two factored solvers outperform the BS method by a margin that grows with the problem dimension n. The right-hand side plot in Fig. 1 compares the relative errors in the computed solution X * , X −X * F X F , for the different methods. The errors for all three algorithms are remarkably similar and as small as could be expected from numerically backward stable methods.

Many-core versions
In this section we describe our GPU-accelerated realisations of the factored solver. Our routines off-load the most expensive computational stages of the method to the accelerator, leaving less-demanding or hardly parallelizable operations to the CPU. Single-GPU variant, ggcsnc gpu . As we stated previously, the method proposed for the generalized stable Sylvester equation is based on two simultaneous matrix iterations. From the perspective of computational cost, these iterations involve two major operations: the matrix inversion (computed as a matrix factorization and the solution of two triangular linear systems) and a matrix update, which can be performed via a matrix multiplication (gemm). Our experiments show that these two operations (which are performed on both iteration matrices) represent approximately 85% of the total cost of the method. Therefore, an important acceleration can be expected from off-loading these operations to the GPU. In our implementation, basic linear algebra kernels such as matrix products, transpositions and norms, are performed using the cuBLAS GPU-accelerated library, while more complex operations, such as LU factorizations and triangular system solves rely on the cuSolver library. It should be noted that both operations occur in every iteration of the procedure. Regarding the data transfers, the equation matrices A, B, D and E are sent to the device only once, prior to the beginning of the procedure. Contrarily, the factors of the solution matrix are retrieved back to the CPU at each iteration, since the compression stage that uses the RRQR factorization to reduce the number of columns/rows of the factors is performed in the CPU. Without compression, the sign function iteration duplicates the number of columns/rows of the left/right factors of the solution at each iteration. The compression procedure leverages the low-rank property of the factors to keep the size of the factors bounded.
Hybrid variant, ggcsnc hyb . As our method for the Sylvester equation combines operations performed in the CPU with others performed in the GPU, it is interesting to analyse how the computation on both devices can be overlapped to maximize the utilization of the hardware. To allow this, it is necessary to re-define the computation workflow. Concretely, we overlap the compression stage (performed in the CPU) with the matrix factorizations and triangular solves (computed in the GPU) corresponding to the next iteration, a strategy commonly referred as look-ahead [27]. This is convenient because the result of the linear systems in a given iteration are used to update the iteration matrices, which are then compressed. This variant can achieve important accelerations when the compression runtime is comparable to the cost of the matrix factorizations and linear system solution.
Dual GPU variant, ggcsnc 2gpus The Newton iteration involves two independent recurrences (for {A} k and {B} k ) with a third one (for {C} k ) that depends on the other two. However, in the factored variant of the method, the {C} k recurrence is replaced by two independent ones (for {F} k and {G} k ). In turn, {F} k depends on {A} k , while {G} k depends on B k . This allows to completely separate the ({A} k ,{F} k )-iteration from the ({B} k ,{G} k )-one, handling each in a separate device. This approach offers two distinct benefits. On the one hand, duplicating the computational power to solve the main stages of the algorithm can strongly reduce the required runtime (up to 2× in the ideal case). On the other hand, the duplication of the memory allows addressing problems of larger scale. Although the two recurrences are independent, the compression of the factors requires a synchronization. Specifically, the synchronization occurs before the RRQR factorization of F k+1 U , given that U is the orthogonal matrix resulting from the RRQR factorization of G k+1 . The communication of data between the CPU memory and the devices also occurs at this point, where F k+1 = [F k , E A −1 F k ] and G k+1 = [G k ; G k B −1 D] are transferred to the CPU memory to be compressed there. It is important to note that, if the compression procedure keeps the number of columns of F k and rows of G k small, the cost of these transfers and the compression itself will be small compared with that of the operations that involve the square matrices A k and B k (of dimension n and m respectively); see the outline of the ggcsnc 2gpus variant in Fig. 2.

Experimental evaluation
In this section we analyse the performance of the generalized Sylvester equation solvers using two platforms for the experiments. The executions involving GPUs were performed on a server equipped with an Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, 64GB of RAM, and two GeForce GTX 980 Ti GPUs with 6GB of GDDR5 memory. The multicore CPU experiments were performed on a server with 2×20-core Intel Xeon Gold 6138 CPUs @ 2.00GHz and 128GB of RAM. ieee double-precision floating-point arithmetic was used for all the executions, and the multicore runs were performed with MATLAB R2018b on top of a multi-threaded instance of Intel MKL. The GPU executions were performed using CUDA 10.2.

Example 3
We construct matricesÂ ∈ IR n×n ,B ∈ IR m×m , F ∈ IR n× p and G ∈ IR p×m with random entries uniformly distributed in U[-1,1]. MatrixÂ is then "stabilized" aŝ A :=Â− Â F I n . Finally, we compute the QR factorizationÂ = Q R and set A := R For the experimental evaluation we include two extra versions as baseline variants. First, a CPU-based variant (ggcsnc) of the iterative solver to compute the factorized solution of the Generalized Sylvester equation implemented in MATLAB 1 . This is the only variant that we execute in the Intel Xeon Gold processor, enabling MATLAB to use 20 CPU threads (the configuration that offers the best performance). Second, a GPU version of the Generalized Sylvester solver that works with the full (nonfactored) solution (ggcsne gpu ). Table 1 summarizes the runtime required by the baseline variants as well as the three GPU versions proposed in this work: ggcsnc gpu , ggcsnc hyb and ggcsnc 2gpus . These implementations are employed to solve three scalable test cases. In our evaluation, we select the cases corresponding to the following matrix dimensions: n = m = 256, 512, 1024, 2048 and 4096.
The results show that the advantage of ggcsne gpu baseline regarding the ggcsnc variant decreases with the problem size. This mostly indicates that the factored solution offers significant benefits for large instances of the evaluated examples. This is confirmed when comparing the ggcsnc gpu and ggcsne gpu variants. For the smallest test cases, with n = m = 256, the acceleration obtained by ggcsnc gpu is mild, and grows consistently with the size of the problem in all three examples. According to the results, the ggcsnc hyb variant does not show any significant performance difference with ggcsnc gpu for Examples 2 and 3, while it only presents a modest advantage in Table 1 Execution time (in milliseconds) for the baseline and proposed variants, on the three example problems and different problem sizes sion 2048. Specifically, we include two main stages: the inversion and update of the iteration matrices; and the compression of the solution matrix. As stated previously, the matrix inversion and update are performed on two matrices independently, and represent the most computationally-demanding parts. The ggcsnc 2gpus variant leverages task-parallelism to off-load each inversion-update sequence to a different GPU, which makes the major achievable runtime reduction equivalent to half the total runtime of these stages. In comparison, in the ggcsnc hyb version the previous stages are concurrently computed with the matrix compression, which implies that the best improvement for this case is equal to the runtime taken by the compression (as this operation typically requires significantly less runtime than the other overlapped operations). In Table 2 (right) we summarize the theoretical and actual runtime reductions. For the ggcsnc hyb solver, the runtime reductions are strongly correlated with the theoretical values as the achieved reductions are mostly equal to the theoretical. It should be remarked that the computational cost of the compression stage depends of the number of rows and columns of the matrices involved. In the three examples of Table 2, these numbers are 36, 53 and 7 respectively, which means that this variant only offers benefits when n and m are large. Regarding the ggcsnc 2gpus variant, the results show that the observed runtime reductions are close to the theoretical values. Specifically, these reductions exceed 80% of the peak for the examples of Table 2, and the benefits increase with the runtime.

Conclusions
We have discussed a matrix sign function-based scheme to directly obtain the factorized solution of the generalized stable Sylvester equation. The factored iteration (in conjunction with rank-revealing QR) allows significant savings in computation time and memory requirements in case the solution has low numerical rank. The novel algorithm can be efficiently parallelized. In this work, we have designed and evaluated implementations that efficiently leverage both data-and task-parallelism on platforms equipped with multicore processors and one or two GPUs. The experimental results confirm the efficacy of the sign function-based solvers and report a considerable advantage that can be realised on massively parallel hardware.
In future work, we intend to generalize our approach to harness distributed platforms equipped with several GPUs, in order to handle large-scale problems.