Efficient Parallel Non-Negative Least Squares On Multi-Core Architectures
Efficient Parallel Non-Negative Least Squares On Multi-Core Architectures
Efficient Parallel Non-Negative Least Squares On Multi-Core Architectures
MULTI-CORE ARCHITECTURES
YUANCHENG LUO AND RAMANI DURAISWAMI
UNIVERSITY OF MARYLAND, COLLEGE PARK
Abstract. We parallelize a version of the active-set iterative algorithm derived from the original
works of Lawson and Hanson (1974) on multi-core architectures. This algorithm requires the solution
of an unconstrained least squares problem in every step of the iteration for a matrix composed
of the passive columns of the original system matrix. To achieve improved performance, we use
parallelizable procedures to efficiently update and downdate the QR factorization of the matrix
at each iteration, to account for inserted and removed columns. We use a reordering strategy of
the columns in the decomposition to reduce computation and memory access costs. We consider
graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our
implementations to that of multi-core CPUs. Both synthetic and non-synthetic data are used in the
experiments.
Key words. Non-negative least squares, active-set, QR updating, parallelism, multi-core, GPU,
deconvolution
In order to apply the KKT conditions to the minimization function (1.1), let ▽f (x) =
AT (Ax − b), gj (x) = −xj , and hk (x) = 0. This leads to the necessary conditions,
and reduce zigzagging. Another approach in [10] produces a sequence of vectors op-
timized at a single coordinate with all other coordinates fixed. These vectors have an
efficiently computable analytical solution that converge to the solution.
Other methods outside the scope of this review include the Principal Block Pivot-
ing method for large sparse NNLS in [7], and the Interior Point Newton-like method
in [1], [16] for moderate and large problems.
2. Active-set Method. Given a set of m linear equations in n unknowns which
are constrained to be non-negative, let the active-set Z be the subset of variables which
violate the non-negativity constraint or are zero and the passive-set P be the variables
with positive values. Lawson and Hanson observe that only a small subset of variables
remains in the candidate active-set Z at the solution. If the true active-set Z is known,
then the NNLS problem is solved by an unconstrained least squares problem using
the variables from the passive-set.
Although the method does not compute and store matrix Q̂, it requires both row and
column access to matrix R̂ and more operations to produce column rp̂i . Computing
and correcting for vector z entails four back-substitutions using matrices R̂T , R̂, and
AˆP . All four back-substitutions requires ℓ row or column memory reads from matrix
R̂ each. Two of the back-substitutions require m row memory reads from matrix
AˆP . The total number of column and row memory reads in the method is 3m + 2ℓ
and one column memory write to update matrix R̂. The entire procedure requires
6m2 + 4m + 1 + 3ℓ2 + 3ℓ flops. The asymptotic complexity is O(m2 + n2 ).
Without the reordering strategy, the same CSNE method computes column rp̂i
and a series of rotations introduces zeros below index i. The rotation transformations
are then applied to the columns to the right of index i in matrix R̂. This requires
an additional ℓ − i row memory reads and writes to matrix R̂ each and 6(ℓ − i +
1)(ℓ/2 − i/2 − 1) + 3(ℓ − i) flops. The asymptotic complexity is O(n2 ). The costs for
the updates are summarized in Table 3.1.
3.4. QR Downdating by Rotations. The reordering strategy is less applica-
ble to the downdating scheme as deleted columns may not be in the right-most index.
Suppose that column ai is removed from matrix AP = [a1 , · · · , ai−1 , ai+1 , · · · , an ]. Let
7
p̂j be the corresponding column index in the ordered list. We consider the reformula-
tion of (3.1) without column p̂j as
where A˜P = Q̃R̃, matrices A˜P and R̃ are missing column p̂j , and matrix Q̃ = Q̂ is
unchanged. Column qp̂j still exists in matrix Q̃ and matrix R̃ is no longer upper-
triangular as the columns to right of index p̂j have shifted left.
Observe that right sub-matrix shifted in matrix R̃ has a Hessenberg form. In [11],
a series of Givens rotations introduces zeros along the sub-diagonal. However, this
does not directly address the removal of column p̂j in matrix Q̃. Instead, we apply
a series of Given rotations to introduce zeros along the j th row of matrix R̃. The
rotations are applied to the right of column p̂j in the transformation
A˜P = (Q̃GTj GTj+1 · · · GTi−1 GTi )(Gi Gi−1 · · · Gj+2 Gj+1 R̃).
..
.
∗ ∗ ∗ ∗
(3.4) 0 ∗ ∗ ∗
(Gi Gi−1 · · · Gj+2 Gj+1 R̃) =
··· 0 0 0 0 ···
0 0 ∗ ∗
0 0 0 ∗
..
.
which preserves A˜P = Q̃R̃ while introducing zeros along the j th row of matrix R̃ and
modifying matrix Q̃. This enables both row j in the updated matrix R̃ and column
j in updated matrix Q̃ to be removed without violating matrix A˜P . Vector Q̃T b is
updated via a similar transformation.
way to map each linear system to a thread is to declare the entire NNLS algorithm
within a parallel OpenMP region. That is, a specified fraction of threads execute
NNLS on a mutually exclusive set of linear system of equations. The remaining
threads are dedicated to the MKL library in order to accelerate common matrix-
vector and vector-vector operations used to solve the unconstrained least squares
sub-problem.
5. GPU Architectures. Recent advances in general purpose graphics process-
ing units (GPUs) have given rise to highly programmable architectures designed with
parallel applications in mind. Moreover, GPUs are considered to be typical of future
generations of highly parallel, multi-threaded, multi-core processors with tremendous
computational horsepower. They are well-suited for algorithms that map to a Single-
Instruction-Multiple-Thread (SIMT) architecture. Hence, GPUs achieve a high arith-
metic intensity (ratio of arithmetic operation to memory operations) when performing
the same operations across multiple threads on a multi-processor.
GPUs are often designed as a set of multiprocessors, each containing a smaller set
of scalar-processors (SP) with a Single-Instruction-Multiple-Data (SIMD) architec-
ture. Hardware multi-threading under a SIMT architecture maps multiple threads to
a single SP. A single SP handles the instruction address and register states of multi-
ple threads so that they may execute independently. The multiprocessor’s SIMT unit
schedules batches of threads to execute a common instruction. If threads of the same
batch diverge via a data-dependent conditional branch, then all the threads along the
separate branches are serialized until they converge back to the same execution path.
GPUs have a hierarchical memory model with significantly different access times
10
to each level. At the top, all multiprocessors may access a global memory pool on the
device. This is the common space where input data is generally copied and stored
from main memory by the host. It is also the slowest memory to access as a single
query from a multi-processor has a 400 to 600 clock cycles latency on a cache-miss.
See [20] for a discussion on coalesced global memory accesses which reads or writes
to a continuous chunk of memory at a cost of one query and implicit caching on the
Fermi architecture. On the same level, texture memory is also located on the device
but can only be written to from hosts. However, it is faster than global memory
when access patterns are spatially local. On the next level, SPs on the same multi-
processor have access to a fast shared memory space. This enables explicit inter-
thread communication and temporary storage for frequently accessed data. Constant
memory, located on each multi-processor, are cached and optimized for broadcasting
to multiple threads. On the lowest level, a SP has its own private set of registers
distributed amongst its assigned threads. The latency for accessing both shared and
per-processor registers normally adds zero extra clock cycles to the instruction time.
Programming models such as NVIDIA’s Compute Unified Device Architecture
(CUDA) [20] and OpenCL [21] organize threads into thread-blocks, which in turn are
arranged in a 2D grid. A thread-block refers to a 1-D or 2-D patch of threads that
are executed on a single multiprocessor. These threads efficiently synchronize their
instructions and pass data via shared memory. Instructions are generally executed in
parallel until a conditional branch or an explicit synchronization barrier is declared.
The synchronization barrier ensures that the thread-block waits for all its threads to
complete its last instruction. Thus, two levels of data parallelism are achieved. The
threads belonging to the same thread-block execute in lock-step as they process a set
of data. Individual thread-blocks execute asynchronously but generally with the same
set of instructions on a different set of data.
While efficient algorithms on sequential processors must reduce the number of
computations and cache-misses, parallel algorithms on GPUs are more concerned
with minimizing data dependencies and optimizing accesses to the memory hierarchy.
Data dependency increases the number of barrier synchronizations amongst threads
and is often subject to the choice of the algorithm. Memory access patterns present
a difficult bottleneck on multiple levels. While latency is the first concern for smaller
problems, we run into a larger issue with memory availability as the problem size
grows. That is, the shared memory and register availability are hard limits that bound
the size and efficiency of thread-blocks. A register memory bound per SP limits the
number of threads assigned to each SP and so decreases the maximum number of
threads and thread-blocks running per multi-processor. A shared memory bound per
multi-processor limits the number of thread-blocks assigned to a multi-processor and
so decreases the total number of threads processed per multi-processor.
5.1. GPU Implementation. One way to map each linear system onto a GPU
is to consider every thread-block as an independent vector processor. Each thread-
block of size m × 1 maps to the elements in a column vector and asynchronously
solves for a mutually exclusive set of linear systems. The number of thread-blocks
that fit onto a single multi-processor depends on the column size m or the number of
equations in the linear system. This poses a restriction on the size of linear systems
that our GPU implementation can solve as the maximum size m is constrained to a
fraction of the amount of shared memory available per multi-processor. Fortunately,
this is not an issue for applications where m is small (500-1000) and the number of
linear systems to be solved is large. However for arbitrarily sized linear systems of
11
equations, our GPU implementation is not generalizable. We note that this is not
an algorithmic constraint but rather a design choice for our application. Our multi-
core CPU implementation of the same algorithm can solve for arbitrarily sized linear
systems. We discussion the details of the GPU implementation in sections 5.2-5.3.
5.2. Parallelizing QR Methods. Full QR decompositions on the GPU via
blocked MGS, Givens rotations, and Householder reflections are implemented in [23],
[15]. While [15] cites that the blocked MGS and Givens rotation methods are ill-
suited for large systems on GPUs as they suffer from instability and synchronization
overhead, we are interested in only the QR updating and downdating schemes for a
large number of small systems. We show that it is possible for m threaded multi-
processors to efficiently perform the MGS updating and Givens rotation downdating
steps.
For the MGS update step, most of the operations are formulated as vector inner
products, scalar-vector products, and vector-vector summations. These operations
lead to an one-to-one mapping between the m × 1 column vector coordinates and
the m threaded thread-block. Such operations are computable via parallel reduction
techniques from [4]. In algorithm 2, we parallelize all four inner products (lines 3, 4,
6, 8) in log m parallel time each. The inner loop iterates for ℓ or at most n times.
Thus, we obtain an order reduction in parallel time-complexity to O(n log m).
For the Givens rotation downdate step, we obtain an one-to-many mapping be-
tween the n × 1 row vector elements and the m threads in a thread-block for matrix
R. We obtain an one-to-one mapping for the m × 1 column vector elements in matrix
Q. Computing vector QT b follows a similar relation. For obtaining the rotation coef-
ficients c, s, a single thread computes and broadcasts to the rest of the thread-block.
In algorithm 3, the inner loop (lines 3-4) updates both matrices R̃ and Q̃ in paral-
lel O(1) time. Writing row and column data and updating vector Q̃T b (line 7) are
thread-independent and computable in O(1) parallel time. Thus, we obtain an order
reduction in parallel time-complexity to O(n).
Parallel reductions are often performed on the GPU in place of common vector-
vector operations using prefix sum discussed in [13]. Algorithm 4 sums 512 elements
in 9 parallel flops, 5 thread-synchronizations, and 18 parallel shared memory accesses.
Each of the 512 threads reserves a memory slot in shared memory. The unique thread
ID or tID denotes the corresponding data index in the shared memory array. At
each step, half the threads from the previous step sum up the data entries stored in
the other half of shared memory. The process continues until index 0 in the shared
memory array contains the total summation.
5.3. Memory Usage. To take advantage of different access times on the GPU
memory hierarchy, the input and intermediate data can be stored and accessed on
different levels for efficient reuse. Local intermediate vectors can either be stored in
shared memory or alternatively in dedicated registers spanning all threads in a thread-
block. List P̂ is stored in shared memory as multiple threads require synchronization
to update and downdate the same column. The right-hand-side vector Q̂T b is stored
in registers since no thread accesses elements outside its one-to-one mapping in the
update and downdate steps.
Global memory accesses on the GPU are unavoidable for updating large matrices
Q̂ and R̂. We store matrix Q̂T so that column vector accesses are coalesced in row-
oriented programming models and matrix R̂ as the Given rotations update the rows.
Matrices Q̂ and R̂ are stored in-place unlike the compact format in (3.1), (3.3). We
allocate m × n blocks of global memory and use the reordered list P̂ to associate
12
column and row indices for the update and downdate steps. This is to avoid any
physical shifts of column vectors in global memory. Rather, we parallel shift the list
P̂ when a variable is removed from the passive-set.
The MGS update step reads ℓ number of columns in matrix Q̂ from global memory
into registers. Computing inner products and vector norms during the projections
requires an intermediate shared memory vector for the parallel reduction function.
The new column for matrix R̂ is locally stored in registers before updated to global
memory. A single element for vector Q̂T b is updated and written to shared memory.
The total number of parallel shared memory accesses is 39ℓ + 2. The total number of
parallel global memory accesses is ℓ + 2.
The Givens rotations downdate step accesses two columns of matrix Q̃ and two
rows of matrix R̃ for each of the ℓ − i transformation. Since row i of matrix R̃ and
column i of matrix Q̃ are fixed across transformations, they are stored and updated
in shared memory. The other row and column are directly updated in global memory.
Updating vector Q̂T b requires two shared memory reads and writes. The total number
of parallel shared memory accesses is 2(ℓ − i) + 2. The total number of parallel global
memory accesses is 2(ℓ − i + 1).
where t is the sample’s time, y(t) the observed signal, and n the number of samples
over time. To solve for the unknown signal x, we rewrite (6.1) as the following linear
13
Ax = b.
Efficient algorithms for the deconvolution problem, which either exploit the simple
structure of the convolution in Fourier space, or which exploit the Toeplitz structure
of the matrix, are available in [5], [8]. However, if signal x is known to be non-negative
and the data y(t) is corrupted by noise, then we may treat the deconvolution as a
NNLS problem to ensure non-negativity.
7. Experiments. As a baseline, we note that Matlab’s lsqnonneg function im-
plements the same active-set algorithm but with a full QR decomposition for the
least squares sub-problem. Matlab 2009b and later versions use Intel’s Math Kernel
Library (MKL) with multi-threading to resolve the least squares sub-problems. For
a better comparison, we first port the lsqnonneg function into native C-code with
calls to multi-threaded MKL BLAS and LAPACK functions. The results from this
implementation (CPU lsqnonneg) show a 1.5-3x speed-up over the Matlab lsqnonneg
function in our experiments. Next, we apply our updating and downdating strategies
with column reordering using MKL BLAS functions to the CPU version. The results
from this second implementation (CPU NNLS) show a 1-3x speed-up that depends
on the number of column updates and downdates. Last, we compare the lsqnonneg
variants to alternative NNLS algorithms from literature.
To compare GPU implementation with the multi-threaded CPU variants, we be-
gin timing the point of entry and exit out of the GPU kernel function. Memory
transfer and pre-processing times in the case of non-synthetic data are omitted. Both
the GPU and CPU variants also obtain identical solutions subject to rounding er-
ror within the same number of iterations for all data sets. We find that for a fewer
number of linear systems, the CPU implementations outperforms the GPU as only a
fraction of the GPU cores are utilized. When the number of linear systems surpasses
the number of multi-processors, the GPU scales better on an order of 1-3x than our
fastest CPU implementation.
For reference, we use a Dual Quad-Core Intel(R) Xeon(R) X5560 CPU @ 2.80GHz
(8 cores) for testing our CPU implementations. The CPU codes compiled under both
Intel icc 11.1, gcc 4.5.1, and linked to MKL 10.1.2.024 yield similar results for 8
run-time threads. The codes tested between Matlab 2010b and 2009b also yield
comparable results. Mixing the number of threads assigned between OpenMP and
MKL did not have a large impact on our system. We use a NVIDIA Tesla C2050 (448
cores across 14 multi-processors) and codes compiled under CUDA 3.2 for the GPU
implementation and testing.
7.1. Non-Synthetic Test Data. For real-world data, we use terrain laser imag-
ing sets obtained from the NASA Laser Vegetation Imaging Sensor (LVIS)1 . Each
14
data set contains multiple 1-D Gaussian-like signals s and observations of total return
energy b of size m = n = 432. In this deconvolution problem, the transfer functions
s represent the single impulse energy fired over time on ground terrain and the ob-
served signals b produces a waveform that indicate the reflected energy over time. The
signals s are generally 15-25 samples wide so the computed matrices A are toeplitz
banded and sparse. NNLS solves for corresponding pairs of matrix A and vector b
to obtain the sparse non-negative solutions x that represent the times of arrival for a
series of a fired impulses. This estimates the ranges or distances to a surface target.
For comparing the NNLS methods, we record the run-times in relation to the
number of column updates and downdates for the least squares sub-problem. The
results from CPU NNLS show a 11x speed-up over the GPU implementation when
solving for a single system. This is due to the underutilization of cores in all but
the multi-processor currently assigned to the linear system of interest. For a larger
number of systems, the GPU results show a 1-2x speed-up over CPU NNLS due in part
to the larger number of processing units suited for vector operations in the algorithm.
The results between CPU NNLS and CPU lsqnonneg show the performance gains
from fewer flops and memory accesses attained by the column reordering, updating,
and downdating strategies.
7.2. Synthetic Test Data. For the first set of synthetic data, we generate
mean shifted 1-D Gaussians with σ = 4.32 to store as columns in matrix A of size
m = n = 512. In this Gaussian fitting problem, each system uses the same matrix
A but with non-negative random vectors b. The choice of the σ parameter ensures
that the mean shifted Gaussians are not too wide as to allow early convergence and
not too narrow as to locally affect only a few variables. Furthermore, matrix A is
now considered dense and vectors b no longer reflect real-world values. We expect the
average number of iterations or column updates and downdates to exceed that of the
real-world data cases.
The total speed-up of GPU NNLS over the CPU variants are more pronounced
(3x compared to CPU NNLS, 23x compared to CPU lsqnonneg). The larger ratio
of column downdate to update steps suggests that our reordering strategy and fast
Givens rotation method in the downdating step outperforms the lsqnonneg variants.
1 https://lvis.gsfc.nasa.gov/index.php
15
For the second set of synthetic data, we generate both random matrices A of size
m = n = 512 and non-negative random vectors b. The number of column updates
and downdates is less than that of the two previous experiments. Furthermore, the
total number of column updates dominates the number of column downdates. The
results between GPU and CPU NNLS show that both implementations have similar
run-time scaling as the number of systems increases. This suggests that the most of
the performance gains in prior experiments are from the GPU downdating steps. The
results between CPU NNLS and CPU lsqnonneg leads to a similar conclusion as the
performance gain (1.7x) is minimum compared to the prior two experiments.
//www.cs.umd.edu/~yluo1/Projects/NNLS.html.
9. Acknowledgements. We would like to acknowledge James Blair and Michelle
Hofton at NASA for providing us with data for the deconvolution problem, NVIDIA
and NSF award 0403313 for facilities, and ONR award N00014-08-10638 for support.
REFERENCES
[1] S. Bellavia, M. Macconi, and B. Morini, An interior point Newton-like method for non-
negative least squares problems with degenerate solution, Numerical Linear Algebra with
Applications, 13 (2006), pp. 825–846.
[2] M. H. van Benthem , M. R. Keenan, Fast algorithm for the solution of large-scale non-
negativity-constrained least squares problems. Journal of Chemometrics, Vol. 18, (2004),
pp. 441–450.
[3] A. Björck, Stability Analysis of the method of semi-normal equations for least squares prob-
lems, Linear Algebra Appl., 88/89 (1987), pp. 31–48.
[4] G. E. Blelloch, Prefix Sums and Their Applications, Technical Report CMU-CS-90-190,
Carnegie Mellon University, Pittsburgh, PA, Nov. (1990).
[5] A. W. Bojanczyk, R. P. Brent and F. R. de Hoog, QR factorization of toeplitz matrices,
Numerische Mathematik, 49 (1986), pp. 81–94.
[6] R. Bro, S. D. Jong, A fast non-negativity-constrained least squares algorithm, Journal of
Chemometrics, Vol. 11, No. 5, (1997), pp. 393–401.
[7] M. Catral, L. Han, M. Neumann, and R. Plemmons, On reduced rank nonnegative matrix
factorization for symmetric nonnegative matrices, Lin. Alg. Appl., 393 (2004), pp. 107–126.
[8] R. H. Chan, J. G. Nagy, and R. J. Plemmons, FFT-based preconditioners for toeplitz-block
least squares problems. SIAM J. Numer. Anal. 30 (1993), pp. 1740–1768.
[9] D. Chen, R. J. Plemmons, Nonnegativity Constraints in Numerical Analysis, Symp on the
Birth of Numerical Analysis, Leuven, Belgium, (2007).
[10] V. Franc, V. Hlavc, and M. Navara, Sequential coordinate-wise algorithm for non-negative
least squares problem, Research report CTU-CMP-2005-06, Center for Machine Perception,
Czech Technical University, Prague, Czech Republic, (2005).
[11] G. H. Golub and C. F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins
University Press, Baltimore, MD, 1996, pp. 223–236.
[12] S. Hammarling, C. Lucas, Updating the QR factorization and the least squares problem,
MIMS EPrint, Manchester Institute for Mathematical Sciences, University of Manchester,
Manchester, (2008).
[13] M. Harris, Optimizing parallel reduction in CUDA, (2007).
[14] Intel, Math kernel library reference manual, (2010).
[15] A. Kerr, D. Campbell, and M. Richards, QR decomposition on GPUs, GPGPU-2: Pro-
ceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units.
New York, NY, USA: ACM, (2009), pp. 71–78.
[16] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, An interior-point method for
large-Scale l1-regularized least squares, IEEE Journal of Selected Topics in Signal Process-
ing 1, no. 4, 606617 (2007).
[17] D. Kim, S. Sra, and I. S. Dhillon, A new projected quasi-Newton Approach for the non-
negative least squares problem, Technical Report TR-06-54, Computer Sciences, The Uni-
versity of Texas at Austin, (2006).
[18] H. W. Kuhn and A. W. Tucker, Nonlinear programming, Proceedings of the Second Berkeley
Symposium on Mathematical Statistics and Probability, University of California Press,
Berkeley, CA, (1951), pp. 481–492.
[19] C. L. Lawson and R. J. Hanson, Solving least squares Problems, PrenticeHall, 1987.
[20] NVIDIA, CUDA programming guide 3.2, (2011).
[21] NVIDIA, OpenCL programming guide for CUDA architectures 3.1, (2010).
[22] Sun Microsystems, inc, OpenMP API user guide, (2003).
[23] V. Volkov and J. W. Demmel, Benchmarking GPUs to tune dense linear algebra, SC08,
(2008).