Cost Matrix
Cost Matrix
Cost Matrix
Let ynj be the data value at position (genomic coordinate) n = 1, . . . , N for replicate array j = 1, . . . , J. Hence we have J arrays and sequences of length N . The goal of this note is to describe an O(N J) algorithm to calculate the cost matrix of a piecewise linear model for the segmentation of the (1, . . . , N ) axis. It is implemented in the function costMatrix in the package tilingArray. The cost matrix is the input for a dynamic programming algorithm that nds the optimal (least squares) segmentation. The cost matrix Gkm is the sum of squared residuals for a segment from m to m + k 1 (i. e. including m + k 1 but excluding m + k),
J m+k1
Gkm :=
j=1 n=m
(ynj km )2
(1)
ynj .
j=1 n=m
(2)
Sidenote: a perhaps more straightforward denition of a cost matrix would be Gm m = G(m m) m , the sum of squared residuals for a segment from m to m 1. I use version (1) because it makes it easier to use the condition of maximum segment length (k <= kmax ), which I need to get the algorithm from complexity O(N 2 ) to O(N ).
Algebra
J m+k1
Gkm =
j=1 n=m 2 ynj n,j
(ynj km )2 1 yn j Jk n ,j
2
(3)
(4)
=
n
1 qn Jk
rn
n
(5)
with qn :=
j 2 ynj
(6) (7)
rn :=
j
ynj
If y is an N J matrix, then the N -vectors q and r can be obtained by q = rowSums(y*y) r = rowSums(y) Now dene
=
n=1
rn qn
n=1
(8) (9)
(dm+k1 dm1 )
1 (cm+k1 cm1 )2 Jk
(10)