Efficient Parallel Non-Negative Least Squares On Multi-Core Architectures

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

EFFICIENT PARALLEL NON-NEGATIVE LEAST SQUARES ON

MULTI-CORE ARCHITECTURES
YUANCHENG LUO AND RAMANI DURAISWAMI
UNIVERSITY OF MARYLAND, COLLEGE PARK

Abstract. We parallelize a version of the active-set iterative algorithm derived from the original
works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution
of an unconstrained least squares problem in every step of the iteration for a matrix composed
of the passive columns of the original system matrix. To achieve improved performance, we use
parallelizable procedures to efficiently update and downdate the QR factorization of the matrix
at each iteration, to account for inserted and removed columns. We use a reordering strategy of
the columns in the decomposition to reduce computation and memory access costs. We consider
graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our
implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the
experiments.

Key words. Non-negative least squares, active-set, QR updating, parallelism, multi-core, GPU,
deconvolution

AMS subject classifications. 15A06, 15A23, 65Y05, 65Y20

1. Introduction. A central problem in data-modelling is the optimization of


underlying parameters specifying a linear model used to describe observed data. The
underlying parameters of the model form a set n variables in a n × 1 vector x =
{x1 , · · · , xn }T . The observed data is composed of m observations in a m×1 vector b =
{b1 , · · · , bm }T . Suppose that the observed data are linear functions of the underlying
parameters in the model, then the function’s values at data points may be expressed
as a m × n matrix A where Ax = b describes a linear mapping from the parameters
in x to the observations in b.
In the general case where m ≥ n, the dense overdetermined system of linear
equations may be solved via a least squares approach. The usual way to solve the
least squares problem is with the QR decomposition of the matrix A where A = QR,
with Q an orthogonal m × n matrix, and R an upper-triangular n × n matrix. Modern
implementations for general matrices use successive applications of the Householder
transform to form QR, though variants based on Givens rotation or Gram-Schmidt
orthogonalization are also viable. Such algorithms carry an associated O(mn2 ) time-
complexity. The resulting matrix equation may be rearranged to Rx = QT b and
solved via back-substitution for x.
Sometimes, the underlying parameters are constrained to be non-negative in order
to reflect real-world prior information. When the data is corrupted by noise, the
estimated parameters may not satisfy these constraints, producing answers which are
not usable. In these cases, it is necessary to explicitly enforce non-negativity, leading
to the non-negative least squares (NNLS) problem considered in this paper.
The seminal work of Lawson and Hanson [19] provide the first widely used method
for solving this non-negative least squares problem. This algorithm, later referred to as
the active-set method, partitions the set of parameters or variables into the active and
passive-sets. The active-set contains the variables with values forcibly set to zero and
which violate the constraints in the problem. The passive-set contains the variables
that do not violate the constraint. By iteratively updating a feasibility vector with
components from the passive-set, each iteration is reduced to an unconstrained linear
least squares sub-problem that is solvable via QR.
1
2

For many signal processing applications, NNLS problems in a few hundred to


a thousand variables arise. In time-delay estimation for example, multiple systems
are continuously stored or streamed for processing. A parallel method for solving
multiple NNLS problems would enable on-line applications, in which the estimation
can be performed as data is acquired. Motivated by such an application, we develop
an efficient algorithm and its implementations on both multi-core CPUs and modern
GPUs.
Section 1.2 summarizes alternative solutions to the NNLS problem. Section 2 es-
tablishes notation and formally describes the active-set algorithm. Section 3 presents
a new method for updating the QR decompositions for the active-set algorithm. Sec-
tions 4-5 describe parallelism on multi-core CPUs and GPU like architectures. Section
6 provides a motivating application from remote estimation and section 7 compares
the GPU and CPU results from experiments.
1.1. Non-negative Least Squares. We formally state the NNLS problem:
Given a m × n matrix A ∈ Rm×n , find a non-negative n × 1 vector x ∈ Rn that
minimizes the functional f (x) = 12 ∥Ax − b∥2 i.e.

(1.1) minx f (x) = 12 ∥Ax − b∥2 , xi ≥ 0.

The Karush-Kuhn-Tucker (KKT) conditions necessary for an optimal constrained


solution to an objective function f (x) can be stated as follows [18]: Suppose x̂ ∈ Rn
is a local minimum subject to inequality constraints gj (x) ≤ 0 and equality constraints
hk (x) = 0, then there exists vectors µ, λ such that

(1.2) ▽f (x̂) + λT ▽ h(x̂) + µT ▽ g(x̂) = 0, µ ≥ 0, µT g(x̂) = 0.

In order to apply the KKT conditions to the minimization function (1.1), let ▽f (x) =
AT (Ax − b), gj (x) = −xj , and hk (x) = 0. This leads to the necessary conditions,

(1.3) µ = ▽f (x̂), ▽f (x̂)T x̂ = 0, ▽f (x̂) ≥ 0, x̂ ≥ 0

that must be satisfied at the optimal solution.


1.2. Survey of NNLS Algorithms. A comprehensive review of the methods
for solving the NNLS problem can be found in [9]. The first widely used algorithm,
proposed by Lawson and Hanson in [19], is the active-set method that we implement on
the GPU. Although many newer methods have since surpassed the active-set method
for large and sparse matrix systems from our survey, the active-set method remains
competitive for small to moderate sized systems with unstructured and dense matrices.
In [6], improvements to the original active-set method are developed for the Fast
NNLS (FNNLS) variant. By reformulating the normal equations that appear in the
pseudo-inverse for the least squares sub-problem, the cross-product matrices AT A
and AT b can be pre-computed. This contribution leads to significant speed-ups in
the presence of multiple right-hand-sides. In [2], further redundant computations are
avoided by grouping similar right-hand-side observations that would lead to similar
pseudo-inverses.
A second class of algorithms is iterative optimization methods. Unlike the active-
set approach, these methods are not limited to a single active constraint at each itera-
tion. In [17], a Projective Quasi-Newton NNLS approach uses gradient projections to
avoid pre-computing AT A and non-diagonal gradient scaling to improve convergence
3

and reduce zigzagging. Another approach in [10] produces a sequence of vectors op-
timized at a single coordinate with all other coordinates fixed. These vectors have an
efficiently computable analytical solution that converge to the solution.
Other methods outside the scope of this review include the Principal Block Pivot-
ing method for large sparse NNLS in [7], and the Interior Point Newton-like method
in [1], [16] for moderate and large problems.
2. Active-set Method. Given a set of m linear equations in n unknowns which
are constrained to be non-negative, let the active-set Z be the subset of variables which
violate the non-negativity constraint or are zero and the passive-set P be the variables
with positive values. Lawson and Hanson observe that only a small subset of variables
remains in the candidate active-set Z at the solution. If the true active-set Z is known,
then the NNLS problem is solved by an unconstrained least squares problem using
the variables from the passive-set.

Algorithm 1 Active-set method for non-negative least squares [19]


Require: A ∈ Rm×n , x = 0 ∈ Rn , b ∈ Rm , set Z = {1, 2, . . . , n} , P = ∅
Ensure: Solution x̂ ≥ 0 s.t. x̂ = arg min 12 ∥Ax − b∥2
1: while true do
2: Compute negative gradient w = AT (b − Ax)
3: if Z ̸= ∅ and maxi∈Z (wi ) > 0 then
4: Let j = arg maxi∈Z (wi )
5: Move j from set Z to P
6: while true do
7: Let matrix AP ∈ Rm×∗ s.t. AP = {columns Ai s.t. i ∈ P }
8: Compute least squares solution y for AP y = b
9: if min(yi ) ≤ 0 then
10: Let α = −mini∈P ( xix−y
i
j
) s.t. (column j ∈ AP ) = (column i ∈ A)
11: Update feasibility vector x = x + α(y − x)
12: Move from P to Z, all i ∈ P s.t. xi = 0
13: else
14: Update x = y
15: break
16: end if
17: end while
18: else
19: return x
20: end if
21: end while

In algorithm 1, the candidate active-set Z is updated by first moving the largest


positive component variable in the negative gradient w to the passive-set (line 5).
This selects the component with the most negative gradient that reduces the residual
2-norm. The variables in the passive-set form a candidate linear least squares system
AP y = b where matrix AP contain the column vectors in matrix A that correspond to
indices in the passive-set (lines 7, 8). At each iteration, the feasibility vector x moves
towards the solution vector y while preserving non-negativity (line 11). Convergence
to the optimal solution is proven in [19].
The termination condition (line 3) checks if the gradient is strictly positive or
4

if the residual can no longer be minimized. At termination, the following relations


satisfy the optimality conditions in (1.3):
1. wi ≤ 0 i∈Z termination condition (line 3).
2. wi = 0 i∈P solving least squares sub-problem (line 8).
3. xi = 0 i∈Z updating sets (line 12).
4. xi > 0 i∈P updating x (lines 10-11).
The variables in the passive-set form the corresponding columns of the matrix AP
in the unconstrained least squares sub-problem AP y = b. As discussed previously,
the cost of solving the unconstrained least squares sub-problem is O(mn2 ) via QR.
If there are k iterations, then the cost of k independent decompositions is O(kmn2 ).
However, the decompositions at each iteration share a similar structure in matrix AP ,
and this can be taken advantage of. We observe the following properties of matrix
AP as the iterations proceed:
1. The active and passive-sets generally exchange a single variable per iteration;
one column is added or removed from matrix AP .
2. Most exchanges move variables from the active-set into the passive-set; early
iterations add variables to an empty passive-set to build the feasible solution,
while later iterations add and remove variables to refine the solution.
Hence, we develop a general method for QR column updating and downdating that
takes advantage of the pattern of movement between variables in the active and
passive-sets. To achieve real-time and on-line processing, the method must be paral-
lelizable on GPUs or other multi-core architectures. We note that the improvements
made to the active-set NNLS proposed in [6], [2] do not apply to our problem, and
moreover do not account for possible efficiencies suggested by the observations above.

3. Proposed Algorithm. The first property of matrix AP suggests that a full


P
A = QR decomposition is unnecessary. Instead, we consider an efficient QR column
updating and downdating method.
1. QR Updating: A new variable added to set P expands matrix AP by a single
column. Update previous matrices Q, R with this column insertion.
2. QR Downdating: The removal of a variable from set P shrinks matrix AP by
a single column. Downdate previous matrices Q, R with this column deletion.
The second property of matrix AP suggests that we can optimize the cost for QR
updating in terms of floating point operations (flops) and column or row memory
accesses. We observe that many QR updating methods minimize computations when
inserting columns at the right-most index. Our method takes advantage of this by
maintaining a separate ordering for the columns of matrix AP by the relative times of
insertions and deletions across iterations. That is, a column insertion always appends
to the end of a reordered matrix AˆP . We describe the effects of the reordering strategy
for various updating methods in sections 3.1-3.3. We also show that the modified
Gram-Schmidt and Givens rotation methods are the most cost efficient with respect
to the reordering strategy for overdetermined and square systems.

3.1. QR Updating by Modified Gram-Schmidt. The reordering strategy


allows a new column ai from the matrix AP = [a1 , · · · , ai , · · · , an ] to be treated as
the right-most column in the decomposition. We define list P̂ as an ordered list of
column indices from set P such that the associated column p̂i−1 is added in a prior
5

iteration to column p̂i . The reordered decomposition AˆP = Q̂R̂ is

AˆP = [ap̂1 , · · · , ap̂i−1 , ap̂i ],


(3.1) Q̂ = [qp̂1 , · · · , qp̂i−1 , qp̂i ],
R̂ = [rp̂1 , · · · , rp̂i−1 , rp̂i ]

where Q̂ is a m × i matrix and R̂ is an i × i matrix. To compute column qp̂i , we


orthogonalize the inserted column ai with all the previous columns in matrix Q̂ via
vector projections. To compute column rp̂i , we take the inner products between
column ai and columns in Q̂, or the equivalent matrix-vector product Q̂T ai . Both
quantities are found using the Modified Gram-Schmidt (MGS) procedure in algorithm
2.

Algorithm 2 Reordered MGS QR Column Updating


Require: Reordered list P̂ contains the elements in set P , index i the variable added
to set P , column ai the new column in AP , columns qj ∈ Q
Ensure: AˆP = Q̂R̂, update vector Q̂T b, list P̂
1: Let vector u = ai
2: for all column index k ∈ list P̂ do
3: u = u − ⟨qk , u⟩qk
4: R̂ki = ⟨ai , qk ⟩
5: end for
u
6: qi = ∥u∥
7: R̂ii = ∥u∥
8: Q̂T bi = ⟨qi , b⟩
9: Add i to list P̂

With the reordering strategy in algorithm 2, a new column ai is always inserted


in the right-most position of matrix AˆP . The number of columns read from memory
in matrix Q̂ is the size of set P , denoted as ℓ ≤ n and is used to form column qi . The
number of column memory writes per step is two, as column qi appends to matrix
Q̂ and the projection step writes a single column to matrix R̂. Updating matrices Q̂
and R̂ requires 6mℓ + 3m + 1 flops. The asymptotic complexity is O(mn).
Without the reordering strategy, column ai can be inserted into the middle of
matrix AˆP . This requires computing column qi the re-orthogonalization of the ℓ −
i columns to its right. The memory access costs of computing qi is i number of
columns reads from matrix Q̂ and two columns writes to matrices Q̂ and R̂. The
re-orthogonalization costs of column qj where j > i is equivalent to a new column
insertion into matrix AP . This is because the MGS method does not compute the
null-space of the basis vectors in matrix Q̂. Orthogonalizing columns qj and qj+1 with
respect to column qi does not preserve the orthogonality between qj and qj+1 . Thus,
each of the ℓ − i + 1 columns must be reinserted with an additional ℓ(ℓ − i + 1) column
reads and 2(ℓ − i + 1) column writes. Updating matrices R̂ and Q̂ requires a total of
(3mℓ + 3mi + 1)(ℓ − i + 2) flops. The asymptotic complexity is O(mn2 ).
3.2. Alternative QR Updating by Rotations. Rotation based methods for
updating QR are possible. In [12], Q̂ and R̂ are treated as m × m matrices where
matrix Q̂ is initially the identity. When inserting column ai , the method appends
m × 1 column vector rp̂i = Q̂T ai to matrix R̂. A series of rotation transformations
6

introduces zeros to rows {i + 1, i + 2, · · · , m} of column rp̂i to preserve the upper-


triangular property. The rotation transformations then update the columns to the
right of index i in matrices R̂ and Q̂. A similar step follows updating the right-hand
side Q̂T bi .
Without the reordering strategy, the costs of this rotation method depend on
index i. Column rp̂i requires m − i rotation transformations. Each transformation
requires two row memory reads and writes to matrix R̂ and two column memory
reads and writes to matrix Q̂ for a total of 2(m − i). This is disadvantageous as the
number of column and row accesses is bound by m and multiple columns and rows
of matrices R̂ and Q̂ are modified. Updating matrix Q̂ and R̂ requires 6m(m − i)
and 2m2 + 8(m − i) + 6(ℓ − i + 1)(ℓ/2 − i/2 − 1) flops respectively. The asymptotic
complexity is O(m2 + n2 ).
With the reordering strategy, index i = ℓ+1 and so many of the costs are reduced.
There are no columns to the right of index i in matrix R̂ so updating is limited to
single column memory write of column rp̂i . Updating matrix Q̂ now requires m − ℓ − 1
column reads and writes each while applying the transformations. Updating matrices
Q̂ and R̂ requires 6m(m − ℓ − 1) and 2m2 + 8(m − ℓ − 1) flops respectively. The
asymptotic complexity is O(m2 ).
3.3. Alternative QR Updating by Semi-normal Equations. The corrected
semi-normal equations (CSNE) can be used to update a ℓ × ℓ matrix R̂ without the
construction of matrix Q̂. The stability analysis of this method is provided by [3].
With the reordering strategy, the problem treats R̂T R̂x = AˆP b where column rp̂i is
computed by

R̂T R̂z = AˆP ai ,


s = ai − AˆP z,
R̂T R̂δz = s,
(3.2)
z = [ z + δz, ]
R̂z
rp̂i = .
∥AˆP z − ai ∥

Although the method does not compute and store matrix Q̂, it requires both row and
column access to matrix R̂ and more operations to produce column rp̂i . Computing
and correcting for vector z entails four back-substitutions using matrices R̂T , R̂, and
AˆP . All four back-substitutions requires ℓ row or column memory reads from matrix
R̂ each. Two of the back-substitutions require m row memory reads from matrix
AˆP . The total number of column and row memory reads in the method is 3m + 2ℓ
and one column memory write to update matrix R̂. The entire procedure requires
6m2 + 4m + 1 + 3ℓ2 + 3ℓ flops. The asymptotic complexity is O(m2 + n2 ).
Without the reordering strategy, the same CSNE method computes column rp̂i
and a series of rotations introduces zeros below index i. The rotation transformations
are then applied to the columns to the right of index i in matrix R̂. This requires
an additional ℓ − i row memory reads and writes to matrix R̂ each and 6(ℓ − i +
1)(ℓ/2 − i/2 − 1) + 3(ℓ − i) flops. The asymptotic complexity is O(n2 ). The costs for
the updates are summarized in Table 3.1.
3.4. QR Downdating by Rotations. The reordering strategy is less applica-
ble to the downdating scheme as deleted columns may not be in the right-most index.
Suppose that column ai is removed from matrix AP = [a1 , · · · , ai−1 , ai+1 , · · · , an ]. Let
7

p̂j be the corresponding column index in the ordered list. We consider the reformula-
tion of (3.1) without column p̂j as

A˜P = [ap̂1 , · · · , ap̂j−1 , ap̂j+1 , · · · , ap̂i ],


(3.3) Q̃ = [qp̂1 , · · · , qp̂j−1 , qp̂j , qp̂j+1 , · · · , qp̂i ],
R̃ = [rp̂1 , · · · , rp̂j−1 , rp̂j+1 , · · · , rp̂i ]

where A˜P = Q̃R̃, matrices A˜P and R̃ are missing column p̂j , and matrix Q̃ = Q̂ is
unchanged. Column qp̂j still exists in matrix Q̃ and matrix R̃ is no longer upper-
triangular as the columns to right of index p̂j have shifted left.
Observe that right sub-matrix shifted in matrix R̃ has a Hessenberg form. In [11],
a series of Givens rotations introduces zeros along the sub-diagonal. However, this
does not directly address the removal of column p̂j in matrix Q̃. Instead, we apply
a series of Given rotations to introduce zeros along the j th row of matrix R̃. The
rotations are applied to the right of column p̂j in the transformation

A˜P = (Q̃GTj GTj+1 · · · GTi−1 GTi )(Gi Gi−1 · · · Gj+2 Gj+1 R̃).
 
..
 . 
 ∗ ∗ ∗ ∗ 
 
(3.4)  0 ∗ ∗ ∗ 
 
(Gi Gi−1 · · · Gj+2 Gj+1 R̃) = 
 ··· 0 0 0 0 ··· 

 0 0 ∗ ∗ 
 
 0 0 0 ∗ 
 
..
.

which preserves A˜P = Q̃R̃ while introducing zeros along the j th row of matrix R̃ and
modifying matrix Q̃. This enables both row j in the updated matrix R̃ and column
j in updated matrix Q̃ to be removed without violating matrix A˜P . Vector Q̃T b is
updated via a similar transformation.

Algorithm 3 Reordered QR Column Downdating with Givens Rotations


Require: Reordered list P̂ contains the elements in set P , index i is the variable
removed from set P
Ensure: A˜P = Q̃R̃, update vector Q̃T b, list P̂
1: for all column indices k following (≻) index i ∈ list P̂ do

2 + R̃2 , R̃ik R̃kk
2: Let r = R̃kk c= r ,[ s=
[ ] ik
[ ] ]
r
R̃k,j≽k∈P̂ c s R̃k,j≽k∈P̂
3: =
R̃i,j≽k∈P̂ −s c R̃i,j≽k∈P̂
[ ]
[ ] [ ] c −s
4: Q̃:,k Q̃:,i = Q̃:,k Q̃:,i
s c
5:
6: Let coefficient b h = Q̃T bk , b l = Q̃T bi
7: Set Q̃T bk = c ∗ b h + s ∗ b l, Q̃T bi = −s ∗ b h + c ∗ b l
8: end for
9: Remove index i from list P̂
8

We refer to [11] for precautions when computing the rotation coefficients c, r in


algorithm 3. When updating matrices R̃ and Q̃, a row or column is fixed so the
transformation requires 2(ℓ − i + 1) row and column memory reads and writes each.
Updating matrices Q̃ and R̃ requires 6m(m − i) and 6(ℓ − i) + 6(ℓ − i + 1)(ℓ/2 − i/2 − 1)
flops respectively. The asymptotic complexity is O(m2 + n2 ). The costs for the
downdates are summarized in Table 3.1.
Algorithm Col/row accesses Up/down Q flops Up/down R flops
MGS/up/reorder ℓ+2 6mℓ + 3m + 1 included in Q
MGS/up/unorder ℓ(ℓ − i + 2) + 2(ℓ − (3mℓ+3mi+1)(ℓ− included in Q
i + 2) i + 2)
Rot/up/reorder 2(m − ℓ − 1) 6m(m − ℓ − 1) 2m2 + 8(m − ℓ − 1)
Rot/up/unorder 4(m − ℓ − 1) 6m(m − i) 2m2 +8(m−i)+6(ℓ−i+1)(ℓ/2−
i/2 − 1)
CSNE/up/reorder 3m + 2ℓ + 1 NA 6m2 + 4m + 1 + 3ℓ2 + 3ℓ
CSNE/up/unorder 3m + 4ℓ − 2i + 1 NA 6m2 + 4m + 1 + 3ℓ2 + 6(ℓ − i +
1)(ℓ/2 − i/2 − 1) + 6ℓ − 3i
Rot/down/NA 4(ℓ − i + 1) 6m(m − i) 6(ℓ−i)+6(ℓ−i+1)(ℓ/2−i/2−1)
Table 3.1
Costs for QR updating/downdating methods with respect to the reordering strategy. The rotation
and CSNE methods have flops of order m2 . For overdetermined and square systems where ℓ ≤ n ≤
m, this quantity is minimized for the modified Gram-Schmidt method.

4. Multi-core CPU Architectures. The multi-core trend began as a response


to the slowdown of Moore’s Law while manufactures approached the limitations in
single-core clock speeds. With additional cores added on chip, individual CPU threads
can be assigned and processed by their own units in hardware. Thus, a single problem
is decomposed and solved by several threads without over-utilizing a single core.
This gave multi-threading an edge over traditional single-core processors as data and
instruction level caches could be dedicated to a smaller sub-set of operations.
Such Multiple-Instruction-Multiple-Data (MIMD) architectures support task-level
parallelism where each core can asynchronously execute separate threads on separate
data regions. The individual cores are often super-scalar and thus capable of process-
ing out-of-order instructions in their pipeline. This allows multi-core architectures
to simulate data-level parallelism from Single-Instruction-Multiple-Data (SIMD) like
architectures such as the GPU with added proficiency. Furthermore, multi-core ar-
chitectures have access to a common pool of main memory off-die and capable of
multi-level caching per core and per processor on-die. For both data and task-level
parallelism, this allows memory to be decomposed and cached on a per-core basis for
efficient reuse.
Several application programming interfaces (APIs) and libraries take advantage of
these shared memory multiprocessing environments for high performance computing.
Open Message Passing (OpenMP) is an API based on fork-join operations where the
program enters into a designated parallel region [22]. Each thread exhibits both task
and data-level parallelism as it independently executes code within a same parallel
region. The Intel Math Kernel Library (MKL) is a set of optimized math routines with
calls to Basic Linear Algebra Sub-programs (BLAS) and Linear Algebra PACKage
(LAPACK) libraries [14]. Many of its fundamental matrix and vector routines are
blocked and solved across multiple threads.
4.1. CPU Implementation. To exploit the advantages of multi-threading, we
adopt both the OpenMP API and the Intel MKL in the CPU implementation. One
9

way to map each linear system to a thread is to declare the entire NNLS algorithm
within a parallel OpenMP region. That is, a specified fraction of threads execute
NNLS on a mutually exclusive set of linear system of equations. The remaining
threads are dedicated to the MKL library in order to accelerate common matrix-
vector and vector-vector operations used to solve the unconstrained least squares
sub-problem.
5. GPU Architectures. Recent advances in general purpose graphics process-
ing units (GPUs) have given rise to highly programmable architectures designed with
parallel applications in mind. Moreover, GPUs are considered to be typical of future
generations of highly parallel, multi-threaded, multi-core processors with tremendous
computational horsepower. They are well-suited for algorithms that map to a Single-
Instruction-Multiple-Thread (SIMT) architecture. Hence, GPUs achieve a high arith-
metic intensity (ratio of arithmetic operation to memory operations) when performing
the same operations across multiple threads on a multi-processor.
GPUs are often designed as a set of multiprocessors, each containing a smaller set
of scalar-processors (SP) with a Single-Instruction-Multiple-Data (SIMD) architec-
ture. Hardware multi-threading under a SIMT architecture maps multiple threads to
a single SP. A single SP handles the instruction address and register states of multi-
ple threads so that they may execute independently. The multiprocessor’s SIMT unit
schedules batches of threads to execute a common instruction. If threads of the same
batch diverge via a data-dependent conditional branch, then all the threads along the
separate branches are serialized until they converge back to the same execution path.

Fig. 5.1. GPU multiprocessor and memory model

GPUs have a hierarchical memory model with significantly different access times
10

to each level. At the top, all multiprocessors may access a global memory pool on the
device. This is the common space where input data is generally copied and stored
from main memory by the host. It is also the slowest memory to access as a single
query from a multi-processor has a 400 to 600 clock cycles latency on a cache-miss.
See [20] for a discussion on coalesced global memory accesses which reads or writes
to a continuous chunk of memory at a cost of one query and implicit caching on the
Fermi architecture. On the same level, texture memory is also located on the device
but can only be written to from hosts. However, it is faster than global memory
when access patterns are spatially local. On the next level, SPs on the same multi-
processor have access to a fast shared memory space. This enables explicit inter-
thread communication and temporary storage for frequently accessed data. Constant
memory, located on each multi-processor, are cached and optimized for broadcasting
to multiple threads. On the lowest level, a SP has its own private set of registers
distributed amongst its assigned threads. The latency for accessing both shared and
per-processor registers normally adds zero extra clock cycles to the instruction time.
Programming models such as NVIDIA’s Compute Unified Device Architecture
(CUDA) [20] and OpenCL [21] organize threads into thread-blocks, which in turn are
arranged in a 2D grid. A thread-block refers to a 1-D or 2-D patch of threads that
are executed on a single multiprocessor. These threads efficiently synchronize their
instructions and pass data via shared memory. Instructions are generally executed in
parallel until a conditional branch or an explicit synchronization barrier is declared.
The synchronization barrier ensures that the thread-block waits for all its threads to
complete its last instruction. Thus, two levels of data parallelism are achieved. The
threads belonging to the same thread-block execute in lock-step as they process a set
of data. Individual thread-blocks execute asynchronously but generally with the same
set of instructions on a different set of data.
While efficient algorithms on sequential processors must reduce the number of
computations and cache-misses, parallel algorithms on GPUs are more concerned
with minimizing data dependencies and optimizing accesses to the memory hierarchy.
Data dependency increases the number of barrier synchronizations amongst threads
and is often subject to the choice of the algorithm. Memory access patterns present
a difficult bottleneck on multiple levels. While latency is the first concern for smaller
problems, we run into a larger issue with memory availability as the problem size
grows. That is, the shared memory and register availability are hard limits that bound
the size and efficiency of thread-blocks. A register memory bound per SP limits the
number of threads assigned to each SP and so decreases the maximum number of
threads and thread-blocks running per multi-processor. A shared memory bound per
multi-processor limits the number of thread-blocks assigned to a multi-processor and
so decreases the total number of threads processed per multi-processor.

5.1. GPU Implementation. One way to map each linear system onto a GPU
is to consider every thread-block as an independent vector processor. Each thread-
block of size m × 1 maps to the elements in a column vector and asynchronously
solves for a mutually exclusive set of linear systems. The number of thread-blocks
that fit onto a single multi-processor depends on the column size m or the number of
equations in the linear system. This poses a restriction on the size of linear systems
that our GPU implementation can solve as the maximum size m is constrained to a
fraction of the amount of shared memory available per multi-processor. Fortunately,
this is not an issue for applications where m is small (500-1000) and the number of
linear systems to be solved is large. However for arbitrarily sized linear systems of
11

equations, our GPU implementation is not generalizable. We note that this is not
an algorithmic constraint but rather a design choice for our application. Our multi-
core CPU implementation of the same algorithm can solve for arbitrarily sized linear
systems. We discussion the details of the GPU implementation in sections 5.2-5.3.
5.2. Parallelizing QR Methods. Full QR decompositions on the GPU via
blocked MGS, Givens rotations, and Householder reflections are implemented in [23],
[15]. While [15] cites that the blocked MGS and Givens rotation methods are ill-
suited for large systems on GPUs as they suffer from instability and synchronization
overhead, we are interested in only the QR updating and downdating schemes for a
large number of small systems. We show that it is possible for m threaded multi-
processors to efficiently perform the MGS updating and Givens rotation downdating
steps.
For the MGS update step, most of the operations are formulated as vector inner
products, scalar-vector products, and vector-vector summations. These operations
lead to an one-to-one mapping between the m × 1 column vector coordinates and
the m threaded thread-block. Such operations are computable via parallel reduction
techniques from [4]. In algorithm 2, we parallelize all four inner products (lines 3, 4,
6, 8) in log m parallel time each. The inner loop iterates for ℓ or at most n times.
Thus, we obtain an order reduction in parallel time-complexity to O(n log m).
For the Givens rotation downdate step, we obtain an one-to-many mapping be-
tween the n × 1 row vector elements and the m threads in a thread-block for matrix
R. We obtain an one-to-one mapping for the m × 1 column vector elements in matrix
Q. Computing vector QT b follows a similar relation. For obtaining the rotation coef-
ficients c, s, a single thread computes and broadcasts to the rest of the thread-block.
In algorithm 3, the inner loop (lines 3-4) updates both matrices R̃ and Q̃ in paral-
lel O(1) time. Writing row and column data and updating vector Q̃T b (line 7) are
thread-independent and computable in O(1) parallel time. Thus, we obtain an order
reduction in parallel time-complexity to O(n).
Parallel reductions are often performed on the GPU in place of common vector-
vector operations using prefix sum discussed in [13]. Algorithm 4 sums 512 elements
in 9 parallel flops, 5 thread-synchronizations, and 18 parallel shared memory accesses.
Each of the 512 threads reserves a memory slot in shared memory. The unique thread
ID or tID denotes the corresponding data index in the shared memory array. At
each step, half the threads from the previous step sum up the data entries stored in
the other half of shared memory. The process continues until index 0 in the shared
memory array contains the total summation.
5.3. Memory Usage. To take advantage of different access times on the GPU
memory hierarchy, the input and intermediate data can be stored and accessed on
different levels for efficient reuse. Local intermediate vectors can either be stored in
shared memory or alternatively in dedicated registers spanning all threads in a thread-
block. List P̂ is stored in shared memory as multiple threads require synchronization
to update and downdate the same column. The right-hand-side vector Q̂T b is stored
in registers since no thread accesses elements outside its one-to-one mapping in the
update and downdate steps.
Global memory accesses on the GPU are unavoidable for updating large matrices
Q̂ and R̂. We store matrix Q̂T so that column vector accesses are coalesced in row-
oriented programming models and matrix R̂ as the Given rotations update the rows.
Matrices Q̂ and R̂ are stored in-place unlike the compact format in (3.1), (3.3). We
allocate m × n blocks of global memory and use the reordered list P̂ to associate
12

Algorithm 4 CUDA parallel floating-point summation routine [13]


__device__ float reduce512( float smem512[], unsigned short tID){
__syncthreads();
if(tID < 256) smem512[tID] += smem512[tID + 256];
__syncthreads();
if(tID < 128) smem512[tID] += smem512[tID + 128];
__syncthreads();
if(tID < 64) smem512[tID] += smem512[tID + 64];
__syncthreads();
if(tID < 32){
smem512[tID] += smem512[tID + 32];
smem512[tID] += smem512[tID + 16];
smem512[tID] += smem512[tID + 8];
smem512[tID] += smem512[tID + 4];
smem512[tID] += smem512[tID + 2];
smem512[tID] += smem512[tID + 1];
}
__syncthreads();
return smem512[0];
}

column and row indices for the update and downdate steps. This is to avoid any
physical shifts of column vectors in global memory. Rather, we parallel shift the list
P̂ when a variable is removed from the passive-set.
The MGS update step reads ℓ number of columns in matrix Q̂ from global memory
into registers. Computing inner products and vector norms during the projections
requires an intermediate shared memory vector for the parallel reduction function.
The new column for matrix R̂ is locally stored in registers before updated to global
memory. A single element for vector Q̂T b is updated and written to shared memory.
The total number of parallel shared memory accesses is 39ℓ + 2. The total number of
parallel global memory accesses is ℓ + 2.
The Givens rotations downdate step accesses two columns of matrix Q̃ and two
rows of matrix R̃ for each of the ℓ − i transformation. Since row i of matrix R̃ and
column i of matrix Q̃ are fixed across transformations, they are stored and updated
in shared memory. The other row and column are directly updated in global memory.
Updating vector Q̂T b requires two shared memory reads and writes. The total number
of parallel shared memory accesses is 2(ℓ − i) + 2. The total number of parallel global
memory accesses is 2(ℓ − i + 1).

6. Application. In remote sensing, a discrete-time deconvolution recovers a sig-


nal x that has been convolved with a transfer function s. The known signal s is often
convolved with an unknown signal x that satisfies certain characteristics.
∫∞ ∫∞
y(t) = s(t) ∗ x(t) = −∞
s(τ )x(t − τ ) dτ = −∞
x(τ )s(t − τ ) dτ
(6.1) ∑∞ ∑n
= τ =−∞ x(τ )s(t − τ ) dτ = τ =1 x(τ )s(t − τ ) dτ

where t is the sample’s time, y(t) the observed signal, and n the number of samples
over time. To solve for the unknown signal x, we rewrite (6.1) as the following linear
13

system of equations Ax = b, where A is a Toeplitz matrix:


 
s(0) s(−1) · · · s(−(n − 1))
 s(1) s(0) · · · s(−(n − 2)) 
 
A =  .. .. .. .. ,
 . . . . 
s(n − 1) s(n − 2) · · · s(0)
(6.2) [ ]T
x = x(1) · · · x(n) ,
[ ]T
b = y(1) · · · y(n) ,

Ax = b.
Efficient algorithms for the deconvolution problem, which either exploit the simple
structure of the convolution in Fourier space, or which exploit the Toeplitz structure
of the matrix, are available in [5], [8]. However, if signal x is known to be non-negative
and the data y(t) is corrupted by noise, then we may treat the deconvolution as a
NNLS problem to ensure non-negativity.
7. Experiments. As a baseline, we note that Matlab’s lsqnonneg function im-
plements the same active-set algorithm but with a full QR decomposition for the
least squares sub-problem. Matlab 2009b and later versions use Intel’s Math Kernel
Library (MKL) with multi-threading to resolve the least squares sub-problems. For
a better comparison, we first port the lsqnonneg function into native C-code with
calls to multi-threaded MKL BLAS and LAPACK functions. The results from this
implementation (CPU lsqnonneg) show a 1.5-3x speed-up over the Matlab lsqnonneg
function in our experiments. Next, we apply our updating and downdating strategies
with column reordering using MKL BLAS functions to the CPU version. The results
from this second implementation (CPU NNLS) show a 1-3x speed-up that depends
on the number of column updates and downdates. Last, we compare the lsqnonneg
variants to alternative NNLS algorithms from literature.
To compare GPU implementation with the multi-threaded CPU variants, we be-
gin timing the point of entry and exit out of the GPU kernel function. Memory
transfer and pre-processing times in the case of non-synthetic data are omitted. Both
the GPU and CPU variants also obtain identical solutions subject to rounding er-
ror within the same number of iterations for all data sets. We find that for a fewer
number of linear systems, the CPU implementations outperforms the GPU as only a
fraction of the GPU cores are utilized. When the number of linear systems surpasses
the number of multi-processors, the GPU scales better on an order of 1-3x than our
fastest CPU implementation.
For reference, we use a Dual Quad-Core Intel(R) Xeon(R) X5560 CPU @ 2.80GHz
(8 cores) for testing our CPU implementations. The CPU codes compiled under both
Intel icc 11.1, gcc 4.5.1, and linked to MKL 10.1.2.024 yield similar results for 8
run-time threads. The codes tested between Matlab 2010b and 2009b also yield
comparable results. Mixing the number of threads assigned between OpenMP and
MKL did not have a large impact on our system. We use a NVIDIA Tesla C2050 (448
cores across 14 multi-processors) and codes compiled under CUDA 3.2 for the GPU
implementation and testing.
7.1. Non-Synthetic Test Data. For real-world data, we use terrain laser imag-
ing sets obtained from the NASA Laser Vegetation Imaging Sensor (LVIS)1 . Each
14

data set contains multiple 1-D Gaussian-like signals s and observations of total return
energy b of size m = n = 432. In this deconvolution problem, the transfer functions
s represent the single impulse energy fired over time on ground terrain and the ob-
served signals b produces a waveform that indicate the reflected energy over time. The
signals s are generally 15-25 samples wide so the computed matrices A are toeplitz
banded and sparse. NNLS solves for corresponding pairs of matrix A and vector b
to obtain the sparse non-negative solutions x that represent the times of arrival for a
series of a fired impulses. This estimates the ranges or distances to a surface target.
For comparing the NNLS methods, we record the run-times in relation to the
number of column updates and downdates for the least squares sub-problem. The
results from CPU NNLS show a 11x speed-up over the GPU implementation when
solving for a single system. This is due to the underutilization of cores in all but
the multi-processor currently assigned to the linear system of interest. For a larger
number of systems, the GPU results show a 1-2x speed-up over CPU NNLS due in part
to the larger number of processing units suited for vector operations in the algorithm.
The results between CPU NNLS and CPU lsqnonneg show the performance gains
from fewer flops and memory accesses attained by the column reordering, updating,
and downdating strategies.

Number of systems 1 24 48 96 192


Number of updates 108 1477 2806 5695 12067
Number of downdates 14 220 406 834 1839

GPU NNLS 0.2172 0.2238 0.2356 0.4508 0.7203


CPU NNLS 0.0186 0.1636 0.2990 0.5989 1.2489
CPU lsqnonneg 0.0770 0.7163 1.2908 2.6225 5.5460
Matlab lsqnonneg 0.1342 1.1236 2.0152 4.1268 8.7176
Matlab FNNLS [6] 0.0493 0.5936 1.1135 2.2639 4.6148
Matlab interior-points [16] 8.3642 135.7896 261.3665 528.4468 1082.8714
Matlab PQN-NNLS [17] 1.4161 138.2953 217.6929 479.9867 862.9674
Table 7.1
Runtime (seconds) comparisons of NNLS and lsqnonneg variants for signal deconvolution. Sig-
nal and observation data taken from LVIS Sierra Nevada, USA (California, New Mexico), 2008.

7.2. Synthetic Test Data. For the first set of synthetic data, we generate
mean shifted 1-D Gaussians with σ = 4.32 to store as columns in matrix A of size
m = n = 512. In this Gaussian fitting problem, each system uses the same matrix
A but with non-negative random vectors b. The choice of the σ parameter ensures
that the mean shifted Gaussians are not too wide as to allow early convergence and
not too narrow as to locally affect only a few variables. Furthermore, matrix A is
now considered dense and vectors b no longer reflect real-world values. We expect the
average number of iterations or column updates and downdates to exceed that of the
real-world data cases.
The total speed-up of GPU NNLS over the CPU variants are more pronounced
(3x compared to CPU NNLS, 23x compared to CPU lsqnonneg). The larger ratio
of column downdate to update steps suggests that our reordering strategy and fast
Givens rotation method in the downdating step outperforms the lsqnonneg variants.
1 https://lvis.gsfc.nasa.gov/index.php
15

Number of systems 1 24 48 96 192


Number of updates 165 3907 7792 15541 30966
Number of downdates 92 2109 4196 8362 16599

GPU NNLS 0.4257 0.5094 0.9546 1.7862 3.2672


CPU NNLS 0.0654 1.2238 2.4141 4.8067 9.6250
CPU lsqnonneg 0.4791 8.7611 17.2483 34.5396 69.4889
Matlab lsqnonneg 0.9437 19.5904 38.6469 77.1072 155.7700
Matlab FNNLS [6] 0.4937 11.7176 23.9502 47.4635 91.4500
Matlab interior-points [16] 1.6317 40.7017 83.9788 164.4839 328.9569
Matlab PQN-NNLS [17] 3.1106 128.6616 253.3796 504.3153 989.2051
Table 7.2
Runtime (seconds) comparisons of NNLS and lsqnonneg variants on multiple systems of mean
shifted Gaussian matrix A and random vectors b.

For the second set of synthetic data, we generate both random matrices A of size
m = n = 512 and non-negative random vectors b. The number of column updates
and downdates is less than that of the two previous experiments. Furthermore, the
total number of column updates dominates the number of column downdates. The
results between GPU and CPU NNLS show that both implementations have similar
run-time scaling as the number of systems increases. This suggests that the most of
the performance gains in prior experiments are from the GPU downdating steps. The
results between CPU NNLS and CPU lsqnonneg leads to a similar conclusion as the
performance gain (1.7x) is minimum compared to the prior two experiments.

Number of systems 1 24 48 96 192


Number of updates 52 1169 2361 4766 9525
Number of downdates 0 13 28 58 124

GPU NNLS 0.1194 0.1397 0.2526 0.4875 0.8667


CPU NNLS 0.0068 0.1157 0.2245 0.4352 0.8794
CPU lsqnonneg 0.0223 0.4665 0.8959 1.7269 3.5243
Matlab lsqnonneg 0.0330 0.8096 1.5583 3.0097 6.1627
Matlab FNNLS [6] 0.0248 0.5902 1.1602 2.2920 4.6022
Matlab interior-points [16] 0.4480 10.7729 21.1767 41.9441 83.6695
Matlab PQN-NNLS [17] 1.5935 55.8196 110.0892 221.7277 441.7190
Table 7.3
Runtime (seconds) comparisons of NNLS and lsqnonneg variants on multiple systems of random
matrices A and random vectors b.

8. Concluding Remarks. In this paper, we have presented an efficient pro-


cedure for solving least squares sub-problems in the active-set algorithm. We have
shown that prior QR decompositions may be used to update and solve similar least
squares sub-problems. Furthermore, a reordering of variables in the passive-set yielded
fewer computations in the update step. This has lead to substantial speed-ups over
existing methods in both the GPU and CPU implementations. Applications to
satellite based terrain mapping and in microphone array signal processing are be-
ing worked on. Both GPU and CPU source codes are available on-line at http:
16

//www.cs.umd.edu/~yluo1/Projects/NNLS.html.
9. Acknowledgements. We would like to acknowledge James Blair and Michelle
Hofton at NASA for providing us with data for the deconvolution problem, NVIDIA
and NSF award 0403313 for facilities, and ONR award N00014-08-10638 for support.

REFERENCES

[1] S. Bellavia, M. Macconi, and B. Morini, An interior point Newton-like method for non-
negative least squares problems with degenerate solution, Numerical Linear Algebra with
Applications, 13 (2006), pp. 825–846.
[2] M. H. van Benthem , M. R. Keenan, Fast algorithm for the solution of large-scale non-
negativity-constrained least squares problems. Journal of Chemometrics, Vol. 18, (2004),
pp. 441–450.
[3] A. Björck, Stability Analysis of the method of semi-normal equations for least squares prob-
lems, Linear Algebra Appl., 88/89 (1987), pp. 31–48.
[4] G. E. Blelloch, Prefix Sums and Their Applications, Technical Report CMU-CS-90-190,
Carnegie Mellon University, Pittsburgh, PA, Nov. (1990).
[5] A. W. Bojanczyk, R. P. Brent and F. R. de Hoog, QR factorization of toeplitz matrices,
Numerische Mathematik, 49 (1986), pp. 81–94.
[6] R. Bro, S. D. Jong, A fast non-negativity-constrained least squares algorithm, Journal of
Chemometrics, Vol. 11, No. 5, (1997), pp. 393–401.
[7] M. Catral, L. Han, M. Neumann, and R. Plemmons, On reduced rank nonnegative matrix
factorization for symmetric nonnegative matrices, Lin. Alg. Appl., 393 (2004), pp. 107–126.
[8] R. H. Chan, J. G. Nagy, and R. J. Plemmons, FFT-based preconditioners for toeplitz-block
least squares problems. SIAM J. Numer. Anal. 30 (1993), pp. 1740–1768.
[9] D. Chen, R. J. Plemmons, Nonnegativity Constraints in Numerical Analysis, Symp on the
Birth of Numerical Analysis, Leuven, Belgium, (2007).
[10] V. Franc, V. Hlavc, and M. Navara, Sequential coordinate-wise algorithm for non-negative
least squares problem, Research report CTU-CMP-2005-06, Center for Machine Perception,
Czech Technical University, Prague, Czech Republic, (2005).
[11] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins
University Press, Baltimore, MD, 1996, pp. 223–236.
[12] S. Hammarling, C. Lucas, Updating the QR factorization and the least squares problem,
MIMS EPrint, Manchester Institute for Mathematical Sciences, University of Manchester,
Manchester, (2008).
[13] M. Harris, Optimizing parallel reduction in CUDA, (2007).
[14] Intel, Math kernel library reference manual, (2010).
[15] A. Kerr, D. Campbell, and M. Richards, QR decomposition on GPUs, GPGPU-2: Pro-
ceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units.
New York, NY, USA: ACM, (2009), pp. 71–78.
[16] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, An interior-point method for
large-Scale l1-regularized least squares, IEEE Journal of Selected Topics in Signal Process-
ing 1, no. 4, 606617 (2007).
[17] D. Kim, S. Sra, and I. S. Dhillon, A new projected quasi-Newton Approach for the non-
negative least squares problem, Technical Report TR-06-54, Computer Sciences, The Uni-
versity of Texas at Austin, (2006).
[18] H. W. Kuhn and A. W. Tucker, Nonlinear programming, Proceedings of the Second Berkeley
Symposium on Mathematical Statistics and Probability, University of California Press,
Berkeley, CA, (1951), pp. 481–492.
[19] C. L. Lawson and R. J. Hanson, Solving least squares Problems, PrenticeHall, 1987.
[20] NVIDIA, CUDA programming guide 3.2, (2011).
[21] NVIDIA, OpenCL programming guide for CUDA architectures 3.1, (2010).
[22] Sun Microsystems, inc, OpenMP API user guide, (2003).
[23] V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, SC08,
(2008).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy