A Note On Entrywise Consistency For Mixed-Data Matrix Completion

Journal of Machine Learning Research 25 (2024) 1-66 Submitted 6/23; Revised 5/24; Published 8/24
A Note on Entrywise Consistency for Mixed-data Matrix Completion
Yunxiao Chen Y. CHEN 186@ LSE . AC . UK

Department of Statistics
London School of Economics and Political Science
London WC2A 2AE, UK
Xiaoou Li LIXX 1766@ UMN . EDU
School of Statistics
University of Minnesota
Minneapolis, MN 55455
Editor: Ali Shojaie
Abstract
This note studies matrix completion for a partially observed n by p data matrix involving mixed
types of variables (e.g., continuous, binary, ordinal). A general family of non-linear factor models
is considered, under which the matrix completion problem becomes the estimation of an n by p
low-rank matrix M. For existing methods in the literature, estimation consistency is established by
√
showing kM̂ − M∗ kF / np, the scaled Frobenius norm of the difference between the estimated
and true M matrices, converges to zero in probability as n and p grow to infinity. However, this
notion of consistency does not guarantee the convergence of each individual entry and, thus, may
not be sufficient when specific data entries or the worst-case scenario is of interest. To address this
issue, we consider the notion of entrywise consistency based on kM̂ − M∗ kmax , the max norm
of the estimation error matrix. We propose refinement procedures that turn estimators, which are
consistent in the Frobenius norm sense, into entrywise estimators through a one-step refinement.
Tight probabilistic error bounds are derived for the proposed estimators. The proposed methods are
evaluated by simulation studies and real-data applications for collaborative filtering and large-scale
educational assessment.
Keywords: Matrix completion; generalized latent factor model; mixed data; entrywise consis-
tency; max norm
1. Introduction
Missing data are commonly encountered in machine learning, especially for large-scale data involv-
ing many observations and variables. Matrix completion concerns the prediction of missing entries
in a partially observed matrix, which has received wide applications, such as collaborative filtering
(Goldberg et al., 1992; Feuerverger et al., 2012), social network recovery (Jayasumana et al., 2019),
sensor localization (Biswas et al., 2006), and educational and psychological measurement (Bergner
et al., 2022; Chen et al., 2023).
Many matrix completion methods consider real-valued matrices (Candès and Recht, 2009; Candès
and Tao, 2010; Keshavan et al., 2010; Klopp, 2014; Koltchinskii et al., 2011; Negahban and Wain-
wright, 2012; Chen et al., 2020c; Xia and Yuan, 2021). Their theoretical guarantees are typically
established under a linear factor model (e.g. Bartholomew et al., 2008), which says the underlying
complete data matrix can be decomposed as the sum of a low-rank signal matrix M and a mean-
c 2024 Yunxiao Chen and Xiaoou Li.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v25/23-0834.html.
C HEN AND L I
zero noise matrix. Under this statistical model, the matrix completion task becomes to estimate the
signal matrix M based on the observed data entries. However, many real applications of matrix
completion involve mixed types of variables (e.g., continuous, count, binary, ordinal), for which
the linear factor model may not be suitable. For example, in survey studies, different questionnaire
items may be of different measurement scales – some items may be binary (e.g., yes/no), some may
be ordinal (e.g., disagree/neutral/agree), while others may be count variables (e.g., the number of
times that one skipped school). Mixed data also appear in multimodal biomedical data, where dif-
ferent types of variables are collected with different technologies (e.g., gene expression, genotype,
protein activity). Methods have been developed for matrix completion with specific variable types,
such as binary (Cai and Zhou, 2013; Davenport et al., 2014; Han et al., 2020, 2023), categorical
(Bhaskar, 2016; Klopp et al., 2015), count (Cao and Xie, 2015; McRae and Davenport, 2021; Robin
et al., 2019), and mixed data (Robin et al., 2020). Non-linear factor models, which are extensions
of the linear factor model, are typically assumed in these works.
A matrix completion Pmethod is typically evaluated by a mean squared error (MSE), defined as
n Pp
kM̂ − M kF /(np) = i=1 j=1 (m̂ij − m∗ij )2 /(np), where k·kF denotes the matrix Frobenius
∗ 2
norm, n×p is the size of the data matrix, and M̂ = (m̂ij )n×p and M∗ = (m∗ij )n×p are the estimated
and true signal matrices, respectively. Probabilistic error bounds have been established for the MSE
in the literature (see Chen et al., 2020c; Chen and Li, 2022; Cai and Zhou, 2016, and references
therein). Under suitable conditions, these error bounds imply that the MSE decays to zero when
both n and p grow to infinity, which is viewed as a notion of statistical consistency for matrix
completion. However, this notion of consistency slightly differs from that in our traditional sense;
that is, the MSE converging to zero does not imply the convergence of each individual entry, which,
however, may be important in some applications which concern the prediction of individual data
entries. Entrywise results for matrix completion have been established under linear factor models
(Abbe et al., 2020; Chen et al., 2019b, 2020c; Chernozhukov et al., 2023). However, such results
are not available for non-linear factor models, and extending these entrywise results to non-linear
factor models is non-trivial.
This note considers a general matrix completion problem that allows the variables to be of
mixed types. The generalized latent factor model (GLFM; Bartholomew et al., 2008; Skrondal
and Rabe-Hesketh, 2004) is a general family of latent variable models that combine factor analysis
with generalized linear modelling. By allowing for variable-specific link functions, the GLFM is
suitable for modelling multivariate data with mixed types. Under the GLFM framework, we propose
two methods that ensure entrywise consistency under dense and sparse missingness settings. Both
methods apply to an initial estimate whose MSE converges to zero. They obtain refined estimates by
solving some estimating equations constructed based on the initial estimate. The difference between
the two methods is that one involves data splitting while the other does not. The two methods have
the same asymptotic behavior under a dense setting where the proportion of observed entries does
not decay to zero. In that case, their entrywise error rate matches the MSE of the initial estimate
up to a logarithm factor, suggesting that there is virtually no loss when performing refinement.
However, under a sparse setting where the proportion of observed entries converges to zero, the
procedure with data splitting achieves a smaller error rate than the one without data splitting, and
the error rate of the data splitting procedure matches the MSE of the initial estimate up to a logarithm
factor. To our best knowledge, the current work is the first one obtaining an entrywise consistent
estimator for counts and binary data, assuming that the counts and binary data follow the Poisson
factor and the multidimensional two-parameter logistic model, respectively. Moreover, it is also
2
E NTRYWISE C ONSISTENCY FOR M IXED - DATA M ATRIX C OMPLETION
the first one for the more general GLFM model for mixed data. Our theoretical analysis further
shows that the refined estimator based on a constrained joint maximum likelihood estimator (Chen
et al., 2020a) for the GLFM is minimax optimal in an entrywise sense under a suitable asymptotic
regime. The proposed methods are evaluated by simulation studies and real-data applications for
collaborative filtering and large-scale educational assessment.
The rest of the note is organized as follows. In Section 2, we introduce a generalized latent factor
model for matrix completion with mixed data. Section 3 introduces two methods for achieving en-
trywise consistency. Theoretical guarantees on the proposed methods are established in Section 4. A
simulation study is given in Section 5, and two real data examples are given in Section 6. Finally, we
conclude with some discussions in Section 7. Additional simulation results and theoretical results,
and proofs of the theorems are given in the appendix. The computation code used in Sections 5 and 6
can be found at https://github.com/yunxiaochen/MatrixCompletion_MixedData.
2. Mixed-data Matrix Completion

2.1 Notation
For a positive integer n, let [n] := {1, · · · , n} be the set containing all the integers 1, ..., n. Let kxk
denote the standard Euclidean norm for a vector x = (x1 , ..., xn )T and kxk∞ = maxi |xi | be the
infinity norm (also called the maximum norm) of a vector. For a matrix X = (xij )n×m , let kXkF ,
kXk∗ and kXk2 denote its Frobenius, nuclear and spectral norms, respectively. We use kXkmax :=
maxi∈[n],j∈[m] |xij | to denote the matrix maximum norm, and use kXk2→∞ := supkuk=1 kXuk∞ to
denote the two-to-infinity norm. According to Proposition 6.1, Cape et al. (2019), P the two-to-infinity
norm is the same as the maximum matrix row norm kXk2→∞ = maxi∈[n] ( j∈[p] x2ij )1/2 . For two
sequences of real numbers, we write an,p bn,p (or an,p = o(bn,p )) if limn,p→∞ an,p /bn,p = 0,
an,p bn,p if limn,p→∞ an,p /bn,p = ∞, an,p . bn,p (or an,p = O(bn,p )) if there is a positive
constant M independent with n and p, such that |an,p |≤ M |bn,p |, an,p & bn,p if there is a positive
constant c independent with n and p, such that |an,p |≥ c|bn,p |, and an,p ∼ bn,p if bn,p . an,p . bn,p .
For two real numbers x and y, we denote their maximum and minimum as x ∨ y = max(x, y) and
x ∧ y = min(x, y), respectively. We use the standard Op (·) and op (·) notation for stochastic
boundedness and convergence in probability, respectively. We use “◦” for the matrix Hadamard
(entrywise) product.
2.2 Problem Setup

Consider an n × p data matrix Y, with the (i, j)th entry denoted by Yij , for i ∈ [n] and j ∈ [p]. In
the rest, we refer to the rows and columns as the observations and variables, respectively. We do not
observe the full matrix due to data missingness. The missing pattern is indicated by an n × p binary
matrix Ω = (ωij )i∈[n],j∈[p] , where ωij = 1 if Yij is observed and ωij = 0 if Yij is missing. Matrix
completion concerns inferring the value of Yij for the missing entries, i.e., entries with ωij = 0.
We consider variables of mixed types, which occurs in many real-world applications; that is, we
allow Yij in different columns to be of mixed types, such as continuous, binary, ordinal, and count
variables. Throughout the paper, we make the following assumption on the missing pattern matrix
Ω.
Assumption 1. The missing indicators, ωij , i ∈ [n], j ∈ [p], are jointly independent. In addition,
Ω and Y are independent.
3
C HEN AND L I
2.3 Generalized Latent Factor Model

Additional assumptions are needed for matrix completion, as otherwise, the missing entries can
take any feasible values. A typical assumption for matrix completion is a low-rank assumption, i.e.,
Y = M + E, where M is a low-rank signal matrix, and E is the noise matrix whose entries are
independent and mean-zero. Let the rank of M be r. Then we can write Y = ΘAT + E, where
Θ and A are n × r and p × r matrices, respectively. This model is typically known as a linear
factor model (e.g. Bartholomew et al., 2008), where Θ and A are referred to as the factor-score and
loading matrices, respectively. The matrix completion task then becomes an estimation problem,
i.e., estimating the signal matrix M = ΘAT based on the observed data entries.
However, the linear factor model may be restricted when not all variables are continuous. The
GLFM is an extension of the linear factor model (Bartholomew et al., 2008; Skrondal and Rabe-
Hesketh, 2004). It assumes that entries Yij are independent, and the probability density function
of Yij (with respect to some baseline measure) takes an exponential family form fj (yij |mij , φj ) =
exp [φ−1j {yij mij − bj (mij )} + cj (yij , φj )], where bj and cj are pre-specified functions, mij is
the (i, j)th entry of a low-rank signal matrix M = ΘAT and φj is a dispersion parameter. The
density function depends on variable j so that the variables can be of different types. We give some
examples below.
Example 1. For a continuous variable j, we may assume fj to be a normal density function, where
φj is the variance, bj (mij ) = m2ij /2 and cj (yij , φj ) = −yij
2 /(2φ ) − (log(2πφ ))/2. When all the
j j
variables follow this normal model, the data matrix follows a linear factor model.
Example 2. Consider a binary or ordinal variable j such that Yij in {0, 1, ..., kj } for some given
kj ≥ 1, where kj = 1 and kj > 1 correspond to binary and ordinal variables, respectively. We can
assume fj to follow a Binomial logistic model, for which φj = 1, bj (mij ) = kj log(1 + exp(mij ))
and cj (yij , φj ) = log(kj ! ) − log(yij ! ) − log((kj − yij )! ). This model has been considered in
Masters and Wright (1984) with psychometric applications. When all the variables are binary
and follow this logistic model, the data matrix is said to follow a multidimensional two-parameter
logistic (M2PL) item response theory model (Reckase, 2009). This model has been considered in
Davenport et al. (2014) and Cai and Zhou (2013) for the completion of binary matrices.
Example 3. A Poisson model may be assumed for count variables j, for which φj = 1, bj (mij ) =
exp(mij ) and cj (yij , φj ) = − log(yij ! ). When all the variables follow this Poisson model, the joint
model for the data matrix is known as a Poisson factor model (Wedel et al., 2003). This Poisson
model has been considered in Robin et al. (2019) and Robin et al. (2020) for count data with missing
values.
Under the GLFM, EY = (b0j (mij ))n×p , where b0j (·) denotes the derivative of the known func-
tion bj (·). Thus, matrix completion under the GLFM again boils down to estimating the signal
matrix M = ΘAT . This estimation problem will be investigated in the rest. We note that a similar
GLFM framework has been considered in Robin et al. (2020) for analyzing mixed data with missing
values. However, they focused on evaluating the estimation accuracy by the MSE, while our main
focus is the entrywise loss.
4
3. Refined Estimation for Entrywise Consistency

The accuracy in estimating M is typically measured by the MSE, or equivalently, a scaled Frobenius
√
norm kM̂ − M∗ kF / np, where M∗ is the underlying true signal matrix. We say an estimator is
√
F-consistent, if kM̂ − M∗ kF / np = op (1). As discussed in Section 3.3 below, a few F-consistent
estimators are available under general or specific GLFMs. However, the F-consistency only guar-
antees consistency in an average sense – the proportion of inconsistently estimated entries decays to
zero. It cannot guarantee entrywise consistency, i.e., the consistency of m̂ij for each individual data
entry, which may be important in some applications concerning the prediction of individual data
entries. Entrywise results for matrix completion, which focus on the loss kM̂ − M∗ kmax , have been
established under linear factor models (Abbe et al., 2020; Chen et al., 2019b, 2020c; Chernozhukov
et al., 2023) but not under the GLFM. Establishing entrywise consistency is more challenging under
the GLFM due to the involvement of non-linear link functions of the exponential family. In what
follows, we propose methods that can improve an F-consistent estimator to an entrywise consistent
(E-consistent) estimator under the GLFM.
3.1 Refinement without Data Splitting

Let M̂ be given by an F-consistent estimator based on observed data (Y ◦ Ω, Ω); see Section 3.3
for examples of such estimators. We propose the following refinement procedure that inputs M̂ and
outputs an E-consistent estimator.
Algorithm 1: Refinement Procedure without Data Splitting

Input: Observed data (Y ◦ Ω, Ω), an initial estimate M̂ and a pre-specified constant C2 .
Step 1. Perform singular value decomposition (SVD) to M̂ and obtain V̂r ∈ Rp×r which
contains the top-r right singular vectors of M̂.
Step 2. Calculate Â = proj{A∈Rp×r :kAk2→∞ ≤C2 } (V̂r ), where
proj{A∈Rp×r :kAk2→∞ ≤C2 } (·) denotes a projection operator that projects a p × r matrix to
satisfy the two-to-infinity norm constraint.
Step 3. For each i ∈ [n], calculate θ̃i by solving an equation:
p
X
ωij {yij − b0j ((âj )T θ̃i )}âj = 0r . (1)
j=1
Step 4. For each j ∈ [p], obtain ãj by solving the following equation:
n
X
ωij {yij − b0j ((ãj )T θ̃i )}θ̃i = 0r . (2)
i=1
Output: M̃ = Θ̃(Ã)T , where Θ̃ = (θ̃1 , · · · , θ̃n )T ∈ Rn×r and Ã = (ã1 , · · · , ãp )T ∈ Rp×r
are obtained from Steps 3 and 4, respectively.
We comment on the implementation. First, the constant C2 depends on the true signal matrix
M∗ .Recall that we assume M∗ to be of rank r under the GLFM. Thus, M∗ can be decomposed
as M = U∗r D∗r (Vr∗ )T , where U∗r ∈ Rn×r and Vr∗ ∈ Rp×r are the left and right singular matrices
∗
5
C HEN AND L I
corresponding to the non-zero singular values, and D∗r ∈ Rr×r is a diagonal matrix whose diagonal
elements are the singular values σ1 (M∗ ) ≥ · · · ≥ σr (M∗ ) > 0. We require C2 to satisfy C2 ≥
kVr∗ k2→∞ . On the other hand, C2 should not be chosen too large. As will be shown in Section 4.2,
it is assumed that C2 has the same asymptotic order as kVr∗ k2→∞ ; otherwise, the error bound for
kM̃ − M∗ kmax needs additional modification. Second, we note that the projection in Step 2 is very
easy to perform. Let V = (v1 , ..., vp )T be a p × r matrix. Then proj{A∈Rp×r :kAk2→∞ ≤C2 } (V) =
(ṽ1 , ..., ṽp )T , where ṽi = vi if kvi k≤ C2 and ṽi = (C2 /kvi k)vi otherwise. Third, the algorithm
requires knowing the number of factors r. Under the GLFM and suitable conditions, this quantity
can be consistently selected based on information criteria (Chen and Li, 2022) or by identifying a
singular value gap using a SVD-based approach (Zhang et al., 2020). Finally, we provide a remark
on solving the equations in Steps 3 and 4.
Remark 1. In Steps 3 and 4, we propose to solve some estimating equations. As will be shown
in Section 4, these equations have a unique solution with probability converging to 1 under a
suitable asymptotic regime. ThesePsteps are equivalent to performing optimization to certain log-
likelihood functions. Let `(M) = i,j:ωij =1 {yij mij − bj (mij )} be a weighted log-likelihood func-
tion based on observed data (Y ◦ Ω, Ω), where the individual log-likelihood terms are weighted
by the dispersion parameters1 . Then, solving the estimating equations (1) is equivalent to solv-
ing Θ̃ ∈ arg maxΘ `(ΘÂT ), and solving the estimating equations (2) is equivalent to solving
Ã ∈ arg maxA `(Θ̃AT ). This is due to that the estimating equations (1) and (2) are obtained
by taking the partial derivatives of `(ΘAT )with respect to Θ and A, respectively, and that the
objective function `(ΘAT ) is convex with respect to Θ and A given the other.
We provide an informal theorem under a simplified setting to shed some light on the asymptotic
behavior of Algorithm 1. Its formal version is Theorem 5 in Section 4.2, which is established under
a more general setting. For the missing pattern Ω = (ωij )i∈[n],j∈[p] , let πij = P(ωij = 1) be the
sampling probabilities and πmin = mini∈[n],j∈[p] πij and πmax = maxi∈[n],j∈[p] πij be the minimal
and maximal sampling probabilities, respectively. The notation π for the sampling probabilities
should be distinguished from the Roman (upright font) notation π for the mathematical constant of
circumference ratio in Example 1.
Theorem 2 (An informal and simplified version of Theorem 5). Assume that limn,p→∞ P(kM̂ −
M∗ kF ≤ eM,F ) = 1 and let M̃ be obtained by Algorithm 1. Then, under suitable assumptions
on M∗ and the asymptotic regime πmin = πmax = π, r is fixed, pπ, nπ (log(np))3 , and
{(n ∧ p)π}−1/2 . (np)−1/2 eM,F π 1/2 (log(np))−2 , with probability tending to 1, we have
kM̃ − M∗ kmax . (log(np))2 π −1/2 (np)−1/2 eM,F .
We clarify that eM,F in the above theorem is a non-random number that depends on n and p. We
consider the asymptotic regime {(n ∧ p)π}−1/2 . (np)−1/2 eM,F above because {(n ∧ p)π}−1/2 is
the minimax error rate of (np)−1/2 kM̂ − M∗ kF ; see Chen and Li (2022).
Remark 3. We provide intuitions on the result of Theorem 2 under the linear factor model setting.
Using Wedin’s sine angle theorem (Wedin, 1972) and under suitable assumptions, one can show
that there exist Θ∗ = (θij
∗) ∗ ∗ ∗ ∗ ∗ T ∗
n×r and A = (aij )p×r , such that M = Θ (A ) , kÂ − A kF .
1. The weighted likelihood is used so that the nuisance parameters φj do not involve in estimating M, which simpli-
fies the theoretical analysis. We believe that the current analysis can be extended to the unweighted log-likelihood
function for the joint estimation of M and dispersion parameters φj .
6
(np)−1/2 eM,F with probability tending to 1 as n and p grow to infinity, kA∗ k2→∞ . p−1/2 , and
kΘ∗ k2→∞ . p1/2 .
Then solving for θ̃i in Step 3 can be viewed as a linear regression problem with a small mea-
surement error in the covariates, where a∗j s are the true covariates and âj are the covariates with
measurement error. Under the linear factor model, bj (mij ) = m2ij /2 for all j. Thus, one can
write down the analytic form for θ̃i that solves Equation (1). From these analytic forms, one
can show that with probability tending to 1, kΘ̃ − Θ∗ k2→∞ . log(np)π −1/2 p1/2 kA∗ − ÂkF .
log(np)π −1/2 n−1/2 eM,F , which also implies that kΘ̃k2→∞ . p1/2 . Here, the log(np) term
comes from a tail bound of maxi=1,...,n,j=1,...,p |Yij − b0 (m∗ij )|. Similarly, one can obtain the an-
alytical expression for ãj that solves Equation (2), which now involves θ̃i − θi∗ , i = 1, ..., n.
From these expressions, one can show that kÃ − A∗ k2→∞ . log(np)p−1 kΘ̃ − Θ∗ k2→∞ .
(log(np))2 π −1/2 n−1/2 p−1 eM,F holds with probability tending to 1. Combining the above results,
it holds that, with probability tending to 1,
kM̃ − M∗ kmax ≤ kΘ̃ − Θ∗ k2→∞ kA∗ k2→∞ +kÃ − A∗ k2→∞ kΘ̃k2→∞

. (log(np))2 π −1/2 (np)−1/2 eM,F .
3.2 Refinement with Data Splitting

√
From Theorem 2 above, we see that kM̃−M∗ kmax achieves the same error rate as kM̂−M∗ kF / np
(up to a logarithm factor) when π ∼ 1. However, when π = o(1), the rate of kM̃ − M∗ kmax be-
√
comes worse than that of kM̂ − M∗ kF / np, due to the factor π −1/2 in the upper bound. This term
comes from the worst-case scenario where Â − A∗ is highly dependent with (ωij )j∈[p] for some i
(e.g., âj − a∗j ≈ ωij b for all j ∈ [p], some i ∈ [n], and some random vector b ∈ Rr ). To obtain a
better error rate under the max norm, we propose a new procedure that uses a data splitting step to
break the dependence between Â and Ω in the following Algorithm 2. The proposed data splitting
method is similar to the one proposed in Chernozhukov et al. (2023) for linear factor models, where
a similar dependence issue exists. However, due to the non-linear link functions involved in the
GLFM, the development of our method and its theory faces unique challenges.
Let N1 ⊂ [n] be a random subset independent of (Y, Ω). In particular, we let I(i ∈ N1 ) be i.i.d.
Bernoulli random variables with P(i ∈ N1 ) = 1/2 for i ∈ [n], where I(·) denotes the indicator
function. By the law of large numbers, N1 is a subset of [n] with size around n/2. We further let
N2 = [n] \ N1 .
The comments on Algorithm 1 regarding the choice of C2 , the number of factors r, the projec-
tion operator, and the solutions to the estimating equations apply similarly to Algorithm 2. As the
rows and columns of the data matrix play a similar role, Algorithm 2 can be modified to split the
columns instead of the rows. As summarized in Theorem 4, which is an informal and simplified
version of Theorem 10 in Section 4.3, Algorithm 2 improves the error rate of Algorithm 1. In fact,
√
kM̃ − M∗ kmax now achieves the same error rate as kM̂ − M∗ kF / np up to a logarithm factor,
regardless of the missing rate π.
Theorem 4 (An informal and simplified version of Theorem 10). Assume that limn,p→∞ P(kM̂Nk · −
M∗Nk · kF ≤ eM,F ) = 1 for eM,F (k = 1, 2) and M̃ is obtained by Algorithm 2. Then, under suitable
assumptions on M∗ and the asymptotic regime πmin = πmax = π, r is fixed, pπ, nπ (log(np))3 ,
7
C HEN AND L I
Algorithm 2: Refinement Procedure with Data Splitting

Input: Observed data (Y ◦ Ω, Ω), a constraint parameter C2 , and initial estimates M̂Nk ,·
for MNk ,· = (mij )i∈Nk ,j∈[p] obtained based on (Y ◦ Ω, Ω)Nk ,· = (yij ωij , ωij )i∈Nk ,j∈[p]
for k = 1, 2.
(1)
Step 1. Perform SVD to M̂N1 ,· and calculate V̂r ∈ Rp×r which contains the top-r right
singular vectors of M̂N1 ,· .
(1) (1)
Step 2. Calculate Â(1) = (âj )Tj∈[p] = proj{A∈Rp×r :kAk2→∞ ≤C2 } (V̂r ).
Step 3. Calculate Θ̃N2 = (θ̃i )Ti∈N2 , where for each i ∈ N2 , θ̃i is obtained by solving the
(1) (1)
equation pj=1 ωij {yij − b0j ((âj )T θ̃i )}âj = 0r .
P
(1) (1)
Step 4. Calculate Ã(1) = (ãj )Tj∈[p] , where for each j ∈ [p], ãj is obtained by solving the
(1)
equation i∈N2 ωij {yij − b0j ((ãj )T θ̃i )}θ̃i = 0r .
P
Step 5. Swap N1 and N2 in Steps 1 – 4, and obtain Θ̃N1 and Ã(2) accordingly.
Output: M̃ = (m̃ij )i∈[n],j∈[p] , where (m̃ij )i∈N1 ,j∈[p] = Θ̃N1 (Ã(2) )T and
(m̃ij )i∈N2 ,j∈[p] = Θ̃N2 (Ã(1) )T .
and {(n ∧ p)π}−1/2 . (np)−1/2 eM,F (log(np))−2 , with probability tending to 1, we have
kM̃ − M∗ kmax . (log(np))2 (np)−1/2 eM,F .
As the data splitting in Algorithm 2 is random, it may be beneficial to run it multiple times
and then aggregate the resulting estimates. We describe this variation of Algorithm 2 below. For a
fixed number of random splittings, the asymptotic behavior of Algorithm 3 is the same as that of
Algorithm 2.
Algorithm 3: Refinement Procedure with Multiple Data Splittings

Input: Observed data (Y ◦ Ω, Ω) a constraint C2 and the number of data splittings tot.
(k) (k) (k)
Step 1. Independently generate index sets N1 andN2 and obtain initial estimates M̂N1
(k)
and M̂N2 based on (Y ◦ Ω, Ω)N (k) ,· = (yij ωij , ωij )i∈N (k) ,j∈[p] and
1 1
(Y ◦ Ω, Ω)N (k) ,· = (yij ωij , ωij )i∈N (k) ,j∈[p] , respectively, for k = 1, 2, ..., tot.
2 2
(k)
Step 2. For k = 1, ..., tot, run Algorithm 2 with data (Y ◦ Ω, Ω), initial estimates M̂N1 and
(k) (k) (k)
M̂N2 , index sets N1 , N2 and a constraint parameter C2 . Obtain outputs M̃(k) ,
k = 1, ..., tot. P
Output: M̃ = ( tot (k)
k=1 M̃ )/tot .
3.3 F-consistent Estimators
Our refinement methods require input from an F-consistent estimator. We give examples of F-
consistent estimators.
8
CJMLE. The constrained joint maximum likelihood estimator (CJMLE) solves the following op-
timization problem
(Θ̂, Â) ∈ arg max `(ΘAT ), s.t. Θ ∈ Rn×r , A ∈ Rp×r , kΘk2→∞ ≤ C, kAk2→∞ ≤ C. (3)
Θ,A
The estimate of M is then given by M̂ = Θ̂ÂT . The terminology “joint likelihood” comes from
the latent variable model literature (Chapter 6, Skrondal and Rabe-Hesketh, 2004). This literature
distinguishes the joint likelihood from the marginal likelihood, depending on whether entries of Θ
are treated as fixed parameters or random variables, where the marginal likelihood is more com-
monly adopted in the statistical inference of traditional latent variable models. This estimator was
first proposed in Chen et al. (2019a) and Chen et al. (2020b) for the estimation of high-dimensional
GLFMs, and an error bound on kM̂ − M∗ kF under a general matrix completion setting can be
found in Theorem 2 of Chen and Li (2022). The computation of (3) can be done by an alternating
maximization algorithm as given in Chen et al. (2020b). This algorithm is theoretically guaran-
teed to converge to a critical point and has good convergence performance according to numerical
experiments (Chen et al., 2020b), though (3) is a nonconvex optimization problem.
More specifically, suppose that the true signal matrix has a decomposition M∗ = Θ∗ (A∗ )T ,
such that kΘ∗ k2→∞ ≤ C and kA∗ k2→∞ ≤ C. Then, under a similar setting as in Theorems 2 and
√
4, we have limn,p→∞ P(kM̂ − M∗ kF / np ≤ κ† {(p ∧ n)π}−1/2 ) = 1, for some finite positive
constant κ† . As shown in Proposition 1 of Chen and Li (2022), {(p ∧ n)π}−1/2 is also the minimax
lower bound for estimating M in the scaled Frobenius norm, which is why this lower bound is
assumed for (np)−1/2 eM,F in Theorems 2 and 4.
NBE. The CJMLE requires solving a non-convex optimization problem for which convergence
to the global optimum is not always guaranteed. The nuclear-norm-constrained-based estimator
(NBE) is a convex approximation to CJMLE. It solves the following optimization problem
√
M̂ ∈ arg max `(M), s.t. kMkmax ≤ ρ0 , kMk∗ ≤ ρ0 rnp. (4)
M
√
The nuclear norm constraint is introduced, since {M ∈ Rn×p : kMkmax ≤ ρ0 , kMk∗ ≤ ρ0 rnp}
is a convex relaxation of {M ∈ Rn×p : kMkmax ≤ ρ0 , rank(M) ≤ r}. This estimator has been
considered in Davenport et al. (2014) for the completion of binary matrices. When the true model
follows the M2PL model and the true signal matrix M∗ satisfies kM∗ kmax ≤ ρ0 , then Theorem 1 of
Davenport et al. (2014) implies that under the same setting of Theorems 2 and 4, limn,p→∞ P(kM̂−
√
M∗ kF / np ≤ κ‡ {(p ∧ n)π}−1/4 ) = 1, where κ‡ is a finite positive constant which depends on the
true model parameters. We believe that the same rate holds for other GLFMs under the simplified
setting of Theorems 2 and 4.
Other estimators. Note that other F-consistent estimators may be available for GLFMs, such as
SVD-based methods (Chatterjee, 2015; Zhang et al., 2020), nuclear-norm-regularized estimators
(Klopp, 2014; Koltchinskii et al., 2011; Negahban and Wainwright, 2012; Robin et al., 2020; Alaya
and Klopp, 2019) and methods based on a matrix factorization norm (Cai and Zhou, 2013, 2016).
4. Theoretical Results
4.1 Assumptions and Useful Quantities
We make the following Assumptions 2 and 3 throughout Section 4.
9
C HEN AND L I
Assumption 2. b1 (x) = · · · = bp (x) = b(x) for all x ∈ R. In addition, b(x) < ∞ and b00 (x) > 0
for all x ∈ R.
We note that this assumption is made for ease of presentation. It can be relaxed to allowing
functions bj to be variable-specific, and similar theoretical results hold following a similar proof. For
each α > 0, define functions κ2 (α) = sup|x|≤α b00 (x), κ3 (α) = sup|x|≤α |b(3) (x)|, and δ2 (α) =
inf |x|≤α b00 (x). Let M∗ have the SVD M∗ = U∗r D∗r (Vr∗ )T where r is the rank of M∗ , U∗r ∈ Rn×r
and Vr∗ ∈ Rp×r are the left and right singular matrices corresponding to the top-r singular values,
respectively, and D∗r ∈ Rr×r is a diagonal matrix whose diagonal elements are the singular values
σ1 (M∗ ) ≥ · · · ≥ σr (M∗ ) > 0. In order to apply the proposed methods, we need to input C2 .
Assumption 3. We choose C2 such that C2 ≥ kVr∗ k2→∞ .
Define the following quantities that depend on M∗ . Let ρ = maxi∈[n],j∈[p] |m∗ij |, C1 =

{kU∗r k2→∞ ∨(r/n)1/2 } · σ1 (M∗ ), κ∗2 = κ2 (2ρ + 1), δ2∗ = δ2 (2ρ + 1), and κ∗3 = κ3 (6C1 C2 ).
4.2 Error Analysis without Data Splitting

Theorem 5. Assume that limn,p→∞ P(kM̂ − M∗ kF ≤ eM,F ) = 1, M̃ is obtained by Algorithm 1,
and the following asymptotic regime holds:
R1: φ1 = · · · = φp = φ ∼ 1;
R2: πmin ∼ πmax ∼ π;
R3: kU∗r k2→∞ . (r/n)1/2 , kVr∗ k2→∞ . (r/p)1/2 , C2 ∼ (r/p)1/2 ;
R4: σr (M∗ ) ∼ σ1 (M∗ ) ∼ (np)1/2 rη for some constants η ≥ −1;

h i
R5: pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 max r(1+2η)∨5 , (κ∗3 )2 r(3+4η)∨7 ;
R6: nπ (κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r3 , (κ∗3 )2 r5 };
R7: (np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (log(np))−2 min [r−5/2 , (κ∗3 )−1 r−7/2 ]π 1/2 .
Then, with probability converging to 1, estimating equations in steps 3 and 4 of Algorithm 1 have a
unique solution and
h i
kM̃ − M∗ kmax . (δ2∗ )−2 (κ∗2 )2 (log(np))2 r5/2 {(n ∧ p)π}−1/2 + (npπ)−1/2 eM,F . (5)
In particular, if we further assume that r ∼ 1, then, the asymptotic regime requirements R5 – R7

can be simplified as pπ (log(np))3 , nπ (log(np))2 and (np)−1/2 eM,F (log(np))−2 π 1/2 ,
and we have that with probability converging to 1,
kM̃ − M∗ kmax . (log(np))2 [{(n ∧ p)π}−1/2 + (npπ)−1/2 eM,F ].
Remark 6. We comment on the asymptotic requirements R1–R7. R1 requires the dispersion pa-
rameters to be the same for different j ∈ [p]. This assumption is made for ease of presentation,
and it can be easily relaxed to allowing varying values of dispersion parameters. It further requires
that the dispersion parameter is bounded as n and p grow large. R2 requires πmax and πmin to
10
be of the same asymptotic order. That is, the missing pattern is not too far from the commonly
adopted uniform missingness assumption where all the πij are the same (see, e.g. Candès and Tao,
2010; Davenport et al., 2014). R3 is a standard incoherent condition that is commonly assumed for
matrix completion to avoid spiky low-rank matrices (Candès and Recht, 2009; Jain et al., 2013).
R4 requires that the non-zero singular values of M∗ are in the same asymptotic order. In addi-
tion, we restrict the analysis to the case where η ≥ −1, because otherwise kM∗ kmax 1 and the
asymptotic regime is less interesting. We note that R4 can be relaxed to a more general asymptotic
regime allowing σr (M∗ ) and σ1 (M∗ ) to have different asymptotic order, and we provide the error
analysis under a more general setting in the appendix. R5 and R6 require the expected number
of non-missing observations for each row and column to be large enough. R7 requires the ini-
tial F-consistent estimator to have a sufficiently small estimation error in scaled Frobenius norm.
In Corollary 8 below, we give sufficient conditions for R5 – R7 under the three specific GLFMs
described in Section 2.
Remark 7. Let M̂CJMLE and M̂N BE denote the constrained joint maximum likelihood estima-
tor and nuclear-norm-constrained-based estimator described in Section 3.3, respectively. Also
let M̃CJMLE and M̃N BE be the corresponding refined estimators by applying Algorithm 1. The-
orem 5 indicates that with high probability kM̃CJMLE − M∗ kmax . (log(np))2 π −1 (n ∧ p)−1/2 and
kM̃NBE − M∗ kmax . (log(np))2 π −3/4 (n ∧ p)−1/4 when r is bounded, under suitable regularity
conditions. Because M̂CJMLE is asymptotically minimax when π ∼ 1 in Frobenius norm, we also
have that M̃CJMLE is asymptotically minimax in the matrix max norm.
In the following corollary, we provide sufficient conditions for R5 - R7 under specific GLFMs
discussed earlier.
Corollary 8. Assume that limn,p→∞ P(kM̂ − M∗ kF ≤ eM,F ) = 1 for some non-random eM,F .
Then, (5) holds under one of the following specific models and asymptotic requirements.
1. Data follow a binomial factor model and the following asymptotic requirements hold: R2 –
R4 and R5B: pπ (n ∨ p)0 r(3+4η)∨7 ; R6B: nπ (n ∨ p)0 r5 ; R7B: (np)−1/2 eM,F
(n ∧ p)−0 π 1/2 r−7/2 ; R8B: k1 = · · · = kp = k ∼ 1; and R9B: ρ . log(n ∧ p)1−0 for some
0 > 0.
2. Data follow a normal factor model and the following asymptotic requirements hold: R1 – R4;
R5N: pπ (log(np))3 r(1+2η)∨5 ; R6N: nπ (log(np))2 r3 ; and R7N: (np)−1/2 eM,F
(log(np))−2 π 1/2 r−5/2 .
3. Data follow a Poisson factor model and the following asymptotic requirements hold: R2 - R4,
R5B – R7B and R10P: r1+η . (log(n ∧ p))1−0 for some 0 > 0.
In the first part of the above corollary, R1 automatically holds because the dispersion parameter
φj = 1 in the binomial model.
Remark 9. We comment on the asymptotic requirements in the above corollary. R5B, R6B, R5N
and R6N require that rank r is relatively small comparing with (n ∧ p)π, and it can grow at most of
the order {(n∧p)π}ν1 for some constant ν1 ∈ (0, 1). Conditions R5B and R6B are slightly stronger
than R5N and R6N, because κ∗3 = 0 for the normal model while κ∗3 ∼ 1 for the binomial model.
Conditions R7B and R7N require the scaled Frobenius norm of the initial estimator to be small.
11
C HEN AND L I
Many F-consistent estimators, including CJMLE and NBE, have the error rate (np)−1/2 eM,F ∼
((n ∧ p)π)−ν2 for some ν2 ∈ (0, 1). For these estimators, R7B and R7N require that r .
((n ∧ p)π)ν3 π 1/2 for some ν3 ∈ (0, 1). Condition R8B requires the kj s to be the same for different
j ∈ [p] and are bounded. This condition can be easily relaxed to a more general setting with
varying but bounded kj s. Condition R9B requires that ρ grows much slower than n and p. Similar
assumptions are made for 1-bit matrix completion (Davenport et al., 2014; Cai and Zhou, 2013).
For Poisson factor models, R10P can be achieved either by an arbitrary r with η = −1 or by
r . (log(n ∧ p))(1−0 )/(1+η) with η > −1.
4.3 Error Analysis with Data Splitting

Theorem 10. Assume that limn,p→∞ P(kM̂Nk · − M∗Nk · kF ≤ eM,F ) = 1 for some non-random
eM,F (k = 1, 2), and M̃ is obtained by Algorithm 2. Assume asymptotic requirements R1 - R6 in
Theorem 5 hold as n, p → ∞. Also, assume the following asymptotic requirements:
R7’ (np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (log(np))−2 min [r−5/2 , (κ∗3 )−1 r−7/2 ].
unique solution and
h i
kM̃ − M∗ kmax . (δ2∗ )−2 (κ∗2 )2 log2 (np)r5/2 {(p ∧ n)π}−1/2 + (np)−1/2 eM,F . (6)
In particular, if we further assume that r ∼ 1, then, the asymptotic regime requirements R5, R6, and
R7’ can be simplified as pπ (log(np))3 , nπ (log(np))2 and (np)−1/2 eM,F (log(np))−2 ,
and we have that with probability converging to 1, kM̃ − M∗ kmax . (log(np))2 [{(n ∧ p)π}−1/2 +
(np)−1/2 eM,F ].
Remark 11. There are two main differences between Theorem 5 and Theorem 10. First, the asymp-
totic requirement R7 has an extra factor π 1/2 when compared with R7’. Second, the error rate
(5) has an extra π −1/2 factor when compared with (6). Thus, when π 1, Algorithm 1 requires
stronger regularity conditions and has a larger error rate. Additional results under a more general
asymptotic regime are provided in the appendix.
The following corollary give sufficient conditions for R7’ to hold under specific GLFMs.
(k)
Corollary 12. Assume that limn,p→∞ P(kM̂Nk · − M∗Nk · kF ≤ eM,F ) = 1 for some non-random
eM,F (k = 1, 2). Then, (6) holds under one of the following specific models and asymptotic re-
quirements.
1. Data follow a binomial factor model and the following asymptotic requirements hold: R2 -
R4, R5B, R6B, R8B, R9B, and R7’B: (np)−1/2 eM,F (n ∧ p)−0 r−7/2 for some 0 > 0.
2. Data follow a normal factor model and the following asymptotic requirements hold: R1 - R4,
R5N, R6N, and R7’N: (np)−1/2 eM,F (log(np))−2 r−5/2 .
3. Data follow a Poisson factor model and that asymptotic requirements R2 - R4, R5B,
R6B,R7’B, and R10P hold.
Remark 9 still applies to Corollary 12, except that now we have a better rate when π is close to
zero.
12
Procedure Initial Refinement Procedure Initial Refinement

1 NBE 5 CJMLE
2 NBE Algorithm 1 6 CJMLE Algorithm 1
Table 1: Estimation procedures compared in a simulation study.
Setting n p r π Setting n p r π
1 400 200 3 0.6 4 400 200 3 0.2
2 800 400 3 0.6 5 800 400 3 0.2
3 1600 800 3 0.6 6 1600 800 3 0.2
Table 2: Simulation settings. All the variables are ordinal (with kj = 5), for which the Binomial
model is assumed.
5. Simulation Study
We evaluate the proposed methods via a simulation study. Eight estimation procedures are consid-
ered as listed in Table 1. For Algorithm 3, five data splittings are performed. These procedures are
applied under 24 simulation settings, where n, p, r, πmax = πmin = π, and variable types are varied.
Settings 1-6 are listed in Table 2, where all the variables follow Binomial distribution with kj = 5.
The rest of the settings and additional details on data generation can be found in the appendix. For
each simulation setting, 100 simulations are conducted.
The procedures are evaluated under two loss functions, the scaled Frobenius norm kM̂ −
√
∗
M kF / np and the max norm kM̂ − M∗ kmax . The results for Settings 1-6 are given in Fig-
ures 1 and 2, and those for the other settings show similar patterns and are given in the appendix.
First, for each procedure and given r and π, both the scaled Frobenius norm and the max norm
decay as n and p grow simultaneously. Second, comparing the two figures, we see that the error
rates are larger under Settings 4-6 than those under Settings 1-3 given the same n, p, and r, as
the proportion of missing entries is higher under Settings 4-6. Third, Procedure 1 (i.e., NBE with
no refinement) has larger error rates than its refined versions (Procedures 2-4), suggesting that the
refinement procedures reduce the error of the initial NBE. Fourth, we see that Procedures 5 and
6 perform similarly, which is expected as they are asymptotically equivalent, as discussed in Re-
mark 7. Fifth, comparing Procedures 2 and 6, we see that the refined NBE and the refined CJMLE
have very similar performance. Similar patterns are observed when comparing Procedures 3 and
7 and when comparing Procedures 4 and 8. At first glance, it may seem a little counter-intuitive.
According to Theorems 5 and 10, the error in the max norm of a refined estimator is upper bounded
by the error in the scaled Frobenius norm of its initial estimator, and thus, we would expect the
CJMLE-based refinements to have smaller errors in the max norm than the NBE-based refinements.
The pattern under the current settings may be explained by the SVD steps in Algorithms 1, 2, and
3 that project the initial estimate to the space of rank-r matrices. Under these settings, the initial
NBE after projection tends to approximate the CJMLE. We note that this is not always the case
under other settings. Under settings 23 and 24 (see their results in the appendix), the CJMLE tends
to outperform the projected NBE, and thus, the CJMLE-based refinements tend to outperform the
13
C HEN AND L I
NBE-based refinements. Finally, comparing within Procedures 2-4 and comparing within Proce-
dures 6-8, we see that Algorithm 1 leads to better empirical performance regardless of the value of
π, even though Algorithm 2 has a faster theoretical convergence speed when π approaches 0. We
conjecture that for CJMLE and NBE, the resulting Â in Step 2 of Algorithm 1 does not have a high
dependence with any rows of Ω when ωij s are uniformly sampled, and thus, the upper bound in (5)
may be improved in this case. We also observe that Algorithm 3 outperforms Algorithm 2 through
aggregating results from multiple runs Algorithm 2. By running Algorithm 2 five times, Algorithm
3 has a similar performance as Algorithm 1.
Setting 1 Setting 2 Setting 3

0.5
0.5
0.5
0.4
0.4
0.4
Scaled Frobenius norm

0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Procedure Procedure Procedure

3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
2.0
Max norm
Max norm
Max norm
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1: Results from Simulation Settings 1-3. The panels on the first row show the results based
on the scaled Frobenius norm, and those on the second row show the results based on the max norm.
In each panel, the box plots show the results of the eight procedures in Table 1, each constructed
from 100 independent simulations.
6. Real Data Examples

6.1 Collaborative Filtering
We apply the proposed method to a MovieLens dataset for movie recommendation (Harper and
Konstan, 2015). The dataset contains 943 users’ ratings on 1,682 movies. Only 6.3% of the data
entries are observed. For each movie, the raw ratings take integer values from 1 to 5. We transform
the values from 0 to 4, and then apply the binomial factor model with kj = 4 for all j. The goal is
to predict the unobserved entries for movie recommendations.
The eight procedures in Table 1 are considered, with candidate rank r = 1, 2, 3, and 4. To
evaluate the procedures, we split the data into training and test datasets, where the training and test
sets contain 80% and 20% of the observed entries, respectively. We estimate the M matrix using the
training set and then evaluate the prediction accuracy by the test-set log-likelihood at the estimated
14
0.7
0.7
0.7
0.6
0.6
0.6

0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

5
5
4
4
Max norm
Max norm
Max norm
3
3
2
2
1
1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 2: Results from Simulation Settings 4-6. The plots can be interpreted similarly as those in
Figure 1.
Procedure Index
Rank 1 2 3 4 5 6 7 8
1 -48928 -49247 -49397 -49253 -49256 -49266 -49266 -49163
2 -53201 -49505 -49767 -48875 -48437 -48493 -48654 -48341
3 -56091 -49284 -49754 -48570 -49022 -49217 -48837 -48207
4 -56235 -49633 -50037 -48611 -51192 -51986 -49174 -48271
Table 3: Test-set log-likelihoods for the MovieLens data. The eight procedures are listed in Table 1.
M. A larger log-likelihood function value implies a higher prediction accuracy. The results are
given in Table 3. The refinement methods improve the test-set log-likelihood of the NBE when
r = 2, 3, 4 but not when r = 1, likely due to the rank-one model being too restrictive for the current
data. Turning to the results from the CJMLE and its refinements, we see that Procedures 5 and 6 tend
to perform similarly. We also see that Procedure 8, which is a refinement of CJMLE by Algorithm
3, tends to improve the test-set log-likelihood of CJMLE under all values of r. Procedure 7 also
performs fine, despite its relatively high variance brought by performing data splitting only once in
Algorithm 2. The good performance of Procedures 7 and 8 is likely due to that the distribution of the
data missingness indicators ωij is far from a uniform distribution. Instead, their distribution likely
depends on the true signal matrix (i.e., people may be more likely to have watched movies that they
like), which may lead to dependence between the initial estimate Â and some rows of Ω when data
splitting is not performed. Such dependence leads to a larger estimation error. The largest test-set
log-likelihood is given by Procedure 8 (i.e., CJMLE refined by Algorithm 3) when r = 3.
15
C HEN AND L I
Procedure Index
Rank 1 2 3 4 5 6 7 8
1 -67205 -67938 -67958 -67921 -67587 -67516 -68204 -68140
2 -71620 -68556 -68733 -67749 -63250 -63313 -64914 -64842
3 -75816 -70092 -70067 -69151 -65476 -65370 -68611 -67693
4 -77632 -72365 -72238 -71640 -72320 -72648 -79466 -75989
Table 4: Test-set log-likelihoods for the PISA data. The eight procedures are listed in Table 1.
6.2 Large-scale Assessment in Education

We apply the proposed method to data from the 2018 Program for International Student Assessment
(PISA; OECD, 2019a), a large-scale international educational survey operated by the Organization
for Economic Co-operation and Development (OECD). We consider a subset of the PISA 2018
dataset, containing 9,970 students’ responses to 415 assessment items. The students were from 37
OECD countries. The 415 assessment items measure four knowledge domains, including mathe-
matics, science, reading, and global competence. A matrix sampling design is adopted in PISA
2018, under which each student was only assigned a subset of assessment items. Consequently,
only 15.5% of the entries are observed in the dataset. Under this matrix sampling design, it is not
sensible to directly compare students’ performance based on their total scores, as the students an-
swered different assessment items, and the items measure different knowledge domains and are not
equally difficult. Among these items, 396 items are dichotomously scored, and 19 items have score
levels 0, 1 and 2. The goal is to predict students’ performance on the items they did not receive in
order to compare the performance based on the entire set of items.
We apply the binomial factor model. Similar to the above analysis, we split 80% and 20% of the
data into training and test sets and evaluate the prediction accuracy by the test-set log-likelihood.
The eight procedures in Table 1 are considered, with candidate rank r = 1, 2, 3, and 4. The results
are given in Table 4. First, the refinement methods tend to improve the test-set log-likelihood given
by the NBE, except for the case when r = 1. The results given by the CJMLE and its refinement by
Algorithm 1 are similar under all values of r. They tend to be better than the refinements given by
Algorithms 2 and 3, likely due to that the variance brought by data splitting is high in this analysis.
Second, the largest test-set log-likelihood is achieved by the CJMLE when the rank r = 2. The test-
set log-likelihoods of the CJMLE and its refinement by Algorithm 1 are similar when r = 2, and
they tend to substantially outperform the rest. In the analysis of PISA data, each of the knowledge
domains is believed to correspond to at least one latent factor. Thus, four- or higher-dimensional
factor models are typically adopted to jointly model the item responses (see Chapter 9, page 22,
OECD, 2019b). Our results suggest that a lower-dimensional factor model may have better predic-
tion performance, though not necessarily have better performance in terms of statistical inference
and interpretation. This finding is closely related to the discussion in psychometrics regarding the
value of subscores (Haberman, 2008).
7. Discussions
This note concerns matrix completion for mixed data under a GLFM framework. It proposes entry-
wise consistent methods for estimating GLFMs based on a partially observed data matrix. Proba-
16
bilistic error bounds are established for the matrix max norm under sensible asymptotic regimes (see
Section 4), and they are extended under a more general asymptotic regime in the appendix. These
error bounds imply the entrywise consistency and, further, characterize the asymptotic behaviors
of the proposed methods. With these error bounds, optimal results are established under suitable
asymptotic regimes. The proposed procedures are applied to two real data examples, one on movie
recommendation and the other on large-scale educational assessment. For the movie recommenda-
tion example, the best predictive model is a rank-three model obtained by refining the CJMLE with
Algorithm 3. For the educational assessment example, a rank-two model given by the CJMLE turns
out to be the most predictive one.
The current work can be extended in several directions. First, some popular factor models,
such as the probit model for binary data considered in Davenport et al. (2014), are not exponential
family GLFMs. We believe that our refinement procedures and their theory can be extended to
many other models beyond the exponential family GLFM. This is because the theoretical properties
of these procedures mainly rely on the convexity of the loss function with respect to M, which still
holds under many other non-linear factor models. Second, the optimal rate for estimating GLFMs
is worth future investigation. We currently do not know whether our upper bounds are minimax
optimal when the dimension r diverges. Sharp lower bounds need to be developed to answer this
question.
Acknowledgments
This research is partially supported by National Science Foundation CAREER DMS-2143844.
Appendix
This appendix provides additional theoretical results, proof of the theorems, and additional sim-
ulation results.
Appendix A. Proof of Theorem 10 and Additional Theoretical Results for

Algorithm 2 with Data splitting
In this section, we obtain the error bound for
kM̃ − M∗ kmax ≤ max(kΘ̃N1 (Ã(1) )T − M∗N1 · kmax , kΘ̃N2 (Ã(2) )T − M∗N2 · kmax ).
We will provide detailed analysis for kΘ̃N1 (Ã(1) )T − M∗N1 · kmax . The analysis of kΘ̃N2 (Ã(2) )T −
M∗N2 · kmax is similar and is thus omitted. For the ease of presentation, we drop the superscript (1) in
Â(1) when the context is clear. Recall that M∗ has the SVD M∗ = U∗r D∗r (Vr∗ )T where U∗r ∈ Rn×r ,
Vr∗ ∈ Rp×r denote the left and right singular matrices, and D∗r = diag(σ1 (M∗ ), · · · , σr (M∗ )).
The rest of the section is organized as follows. In Section A.1, we obtain an error bound for kÂ−
A∗ kF where A∗ = Vr∗ P̂ for a carefully chosen orthogonal matrix P̂. In Section A.2, we provide
non-asymptotic and non-probabilistic bounds for solutions to the non-linear estimation equations
used in Step 3 and 4 in the proposed Algorithm 2. In Section A.3, we obtain non-asymptotic
17
C HEN AND L I
probabilistic bounds for terms involved in Section A.2. In Section A.4, we put together results
in Sections A.1 – A.3 and obtain asymptotic error bounds for kΘ̃N2 − Θ∗ k2→∞ (Lemma 38),
kÃ − A∗ k2→∞ (Lemma 39), and kΘ̃N1 (Ã(1) )T − M∗N1 · kmax (Lemma 40) where Θ∗ = U∗r D∗r P̂.
Finally, we provide additional theoretical results for Algorithm 2 in Section A.5 and the proof of
Theorem 10 in Section A.6.
Throughout the analysis, for real number operators, we calculate multiplication and division
before the max and min operators (‘∨’ and ‘∧0 ) unless otherwise specified. For example, u(xy ∨
z/w) = u max(xy, z/w) for real numbers x, y, u, w, z. For two events A and B, we say ‘event A
has probability at least 1 − on event B’, if P(Ac ∩ B) ≤ . Note that P(A) ≥ 1 − − P(B c ) in
this case.
A.1 Error Analysis for Â

In this section, we provide an error bound for Â given an error bound for M̂N1 · .
Lemma 13. Let ψr = σr (M∗N1 · ) ∧ σr (M∗N2 · ) and ψ1 = σr (M∗N1 · ) ∨ σr (M∗N2 · ). If kM̂N1 · −

M∗N1 · k2 ≤ 2−1 ψr , kVr∗ k2→∞ ≤ C2 and rank(M∗ ) = r, then there exists an orthogonal matrix
P̂ ∈ Rr×r satisfying
kÂ − Vr∗ P̂kF ≤ 8ψr−1 kM̂N1 · − M∗N1 · kF . (7)
Proof [Proof of Lemma 13] According to Weyl’s inequality and the assumption that kM̂N1 · −
M∗N1 · k2 ≤ 2−1 ψr , σr (M̂N1 · ) ≥ σr (M∗N1 · ) − kM̂N1 · − M∗N1 · k2 ≥ 2−1 σr (M∗N1 · ) ≥ 2−1 ψr . Thus
the gaps of singular value satisfies
h i n o
min min {σi (M̂N1 · )−σj (M∗N1 · )}, min σi (M̂N1 · ) = min σr (M̂N1 · ), σr (M∗N1 · ) ≥ 2−1 ψr .
1≤i≤r,j>r 1≤i≤r
(8)
∗
Let Vr,N ∈ R p×r be the right singular value matrix corresponding to the top-r singular values of
1·
M∗N1 · and
P† = arg minkV̂r − Vr,N ∗
1·
PkF , (9)
P∈Or
where Or denotes the set of all r × r orthogonal matrices. According to the above equations and
Wedin’s sine angle theorem (Wedin, 1972),
∗
2kM̂N1 · − M∗N1 · kF 4kM̂N1 · − M∗N1 · kF
kV̂r − Vr,N 1·
P† kF = inf kV̂r − Vr,N
∗
1·
PkF ≤ ≤ .
P∈Or σr (M̂N1 · ) ψr
(10)
On the other hand, since σr (M∗N1 · ) ≥ ψr > 0, the column space of (M∗N1 · )T is the same as
∗
the columns space of Vr,N and that of Vr∗ . This implies that there exists an orthogonal matrix
1·
P̄ ∈ R r×r such that Vr,N1 · = Vr∗ P̄, which further implies that for the orthogonal matrix
∗
P̂ = P̄P† , (11)
we have kV̂r − Vr∗ P̂kF ≤ 4ψr−1 kM̂N1 · − M∗N1 · kF . According to Algorithm 2, Â is the projection
of V̂r to the set {A ∈ Rp×r : kAk2→∞ ≤ C2 } and kVr∗ P̂k2→∞ = kVr∗ k2→∞ ≤ C2 . Thus,
kÂ − Vr∗ P̂kF ≤ kÂ − V̂r kF +kV̂r − Vr∗ P̂kF ≤ 2kV̂r − Vr∗ P̂kF ≤ 8ψr−1 kM̂N1 · − M∗N1 · kF . (12)
18
The next lemma is obtained by directly applying Lemma 13.

Lemma 14. If limn,p→∞ P(kM̂N1 · − M∗N1 · kF ≥ eM,F ) = 0, eM,F is a non-random number (de-
pending on n and p), kVr∗ k2→∞ ≤ C2 and eM,F ≤ 2−1 ψr , then
lim P(kÂ − Vr∗ P̂kF ≥ eA,F ) = 0, (13)

n,p→∞
where P̂ is defined in (11) and eA,F = 8ψr−1 eM,F .
A.2 Non-probabilistic Bounds for Solutions to Estimating Equations

Recall that for each i ∈ [n], the partial score function corresponding to θi is
p
∂ X
S1,i (θi ; A) := `(Θ, A) = φ−1 ωij {yij − b0 (aTj θi )}aj (14)
∂θi
j=1
The next lemma provides a non-probabilistic bound for the solution to the partial score equation
S1,i (θi , A) = 0r .
Lemma 15. Let Θ∗ ∈ Rn×r and A∗ ∈ Rp×r be such that M∗ = Θ∗ (A∗ )T and Z =
(zij ) with zij = yij − b0 (m∗ij ) and diag(Ωi· ) := diag(ωi1 , · · · , ωip ). If kΘ∗ k2→∞ ≤ C1 ,
kA∗ k2→∞ , kAk2→∞ ≤ C2 and there exists ξ > 0 such that
2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (C2 (C1 + ξ))}
(15)
≤ξ ≤ 2−1 {γ1,i (A)κ3 (C2 (C1 + ξ)}−1 σK (I1,i (A)),
where we define Zi· = (zij )j∈[p] ∈ R1×p ,
p
X
B1,i (A) := ωij b00 (m∗ij )aj (aj − a∗j )T θi∗ ∈ Rr , (16)
j=1
p
X
I1,i (A) := ωij b00 (m∗ij )aj (aj )T , (17)
j=1
and
X X
β1,i (A) := sup ωij ((aj − a∗j )T θi∗ )2 |aTj u| and γ1,i (A) := sup ωij |aTj u|3 , (18)
kuk=1 j kuk=1 j
then, there is θ̃i such that kθ̃i − θi∗ k≤ ξ and S1,i (θ̃i ; A) = 0.
Proof [Proof of Lemma 15] Let θ be a vector such that kθ − θi∗ k= ξ and let mij = aTj θi . Consider
the Taylor expansion of φS1,i (θ; A),
X X
φS1,i (θ; A) = ωij (yij − b0 (m∗ij ))aj − ωij (b0 (mij ) − b0 (m∗ij ))aj
j j
X X
=AT diag(Ωi· )ZTi· − ωij b00 (m∗ij )(mij − m∗ij )aj − 2−1 ωij b(3) (m̃ij )(mij − m∗ij )2 aj ,
j j
(19)
19
C HEN AND L I
for some m̃ij between m∗ij and mij . Plugging mij − m∗ij = aTj (θ − θi∗ ) + (aj − a∗j )T θi∗ into the
above display, we obtain
X
φS1,i (θ; A) =AT diag(Ωi· )ZTi· − ωij b00 (m∗ij )aj aTj (θ − θi∗ )
j
X X . (20)
00
− ωij b (m∗ij )aj (aj − a∗j )T θi∗ − 2−1 ωij b(3) (m̃ij )(mi − m∗ij )2 aj
j j
Multiplying (θ − θi∗ )T on both sides, we obtain
φ(θ − θi∗ )T S1,i (θ; A)

X
=(θ − θi∗ )T AT diag(Ωi· )ZTi· − (θ − θi∗ )T ωij b00 (m∗ij )aj aTj (θ − θi∗ )
j
(21)
X
− (θ − θi∗ )T ωij b 00
(m∗ij )aj (aj − a∗j )T θi∗
j
X
−1
−2 (θ − θi∗ )T ωij b(3) (m̃ij )(mi − m∗ij )2 aj .
j
Recall that kθ − θi∗ k= ξ. Using inequalities about matrix products and singular values, we have the
following upper bounds for the first three terms on the right-hand side of the above display.
|(θ − θi∗ )T AT diag(Ωi· )ZTi· |≤ ξkAT diag(Ωi· )ZTi· k= ξkZi· diag(Ωi· )Ak, (22)
X
− (θ − θi∗ )T ωij b00 (m∗ij )aj aTj (θ − θi∗ ) ≤ −ξ 2 σr (I1,i (A)), (23)
j
where σr (I1,i (A)) denotes the r-th largest singular value of I1,i (A), and
X
|(θ − θi∗ )T ωij b00 (m∗ij )aj (aj − a∗j )T θi∗ |= k(θ − θi∗ )T B1,i k≤ ξkB1,i k. (24)
j
Now we analyze the last term 2−1 (θ − θi∗ )T j ωij b(3) (m̃ij )(mi − m∗ij )2 aj . Note that |m̃ij |≤
P
|mij ∗ |∨|mij |≤ (C1 + ξ)C2 and mij − m∗ij = aTj (θ − θi∗ ) + (aj − a∗j )T θi∗ , we have
X
2−1 (θ − θi∗ )T b(3) (m̃ij )(mi − m∗ij )2 aj
j
X
≤2−1 κ3 ((C1 + ξ)C2 )ξ sup ωij ((aj − a∗j )T θi∗ + ξaTj u)2 |aTj u|
kuk=1 j (25)
X X
≤κ3 ((C1 + ξ)C2 ){ξ sup ωij ((aj − a∗j )T θi∗ )2 |aTj u|+ξ 3 sup ωij |aTj u|3 }
kuk=1 j kuk=1 j
=κ3 ((C1 + ξ)C2 )(ξβ1,i + ξ 3 γ1,i ).
Combining the analysis with (21), (22), (23), and (24), we obtain
(θ − θi∗ )T φS1,i (θ; A) ≤ − σr (I1,i (A))ξ 2 + γ1,i κ3 ((C1 + ξ)C2 )ξ 3

(26)
+ {kZi· diag(Ωi· )Ak+kB1,i k+β1,i κ3 ((C1 + ξ)C2 )}ξ.
20
Now, we view the right-hand side of the above inequality as a cubic function in ξ. For any cubic
function f (x) = −ax2 + bx3 + cx with a, b, c > 0, it is easy to verify that if 2c/a ≤ x ≤ a/(2b),
then f (x) ≤ 0. Applying this result, we can see that supkθ−θi∗ k=ξ (θ − θi∗ )T S1,i (θ; A) ≤ 0, if the
following inequalities hold:
−1
2σK (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 ((C1 + ξ)C2 )}
(27)
≤ξ ≤ 2−1 {γ1,i κ3 ((C1 + ξ)C2 )}−1 σr (I1,i (A)).
According to Result 6.3.4 in Ortega and Rheinboldt (2000), supkθ−θi∗ k=ξ (θ − θi∗ )T S1,i (θ; A) ≤ 0
implies that there is a solution S1,i (θ̃; A) = 0 satisfying kθ̃ − θi∗ k≤ ξ.
Next, we simplify the result of Lemma 15 to obtain a more user-friendly version in the next lemma.
Lemma 16. Let Θ∗ ∈ Rn×r and A∗ ∈ Rp×r be such that M∗ = Θ∗ (A∗ )T and Z = (zij ) with
zij = yij − b0 (m∗ij ). If kA∗ k2→∞ ≤ C2 and kAk2→∞ ≤ C2 , and
kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )

n o (28)
≤ min 2−2 (γ1,i (A))−1 (κ3 (3C1 C2 ))−1 σr2 (I1,i (A)), 2−1 σr (I1,i (A))C1 ,
then, there is θ̃i such that S1,i (θ̃; A) = 0, and
kθ̃i − θi∗ k≤ 2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )}. (29)
Moreover, the solution θ̃i also satisfies kθ̃i − θi∗ k≤ C1 .
Proof [Proof of Lemma 16]

Let ξ = 2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )}. By the assump-
tion that kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 ) ≤ 2−1 σr (I1,i )C1 , we have ξ ≤ C1 .
Thus,
κ3 (C2 (C1 + ξ)) ≤ κ3 (3C1 C2 ). (30)
This implies
2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (C2 (C1 + ξ))}
(31)
≤2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )}.
Because the right-hand side of the above inequality equals ξ, it is simplified as
2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (C2 (C1 + ξ))} ≤ ξ. (32)
On the other hand, according to the assumption that kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 ) ≤
−1
2−2 γ1,i (κ3 (3C1 C2 ))−1 σr2 (I1,i (A)), we further have
ξ =2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )}

−1
≤2−1 γ1,i (κ3 (3C1 C2 ))−1 σr (I1,i (A)) (33)
−1 −1
≤2 {γ1,i (A)κ3 (C2 (C1 + ξ)} σr (I1,i (A)).
21
C HEN AND L I
Equations (32) and (33) together imply (15). By Lemma 15, there is θ̃i such
that kθ̃i − θi∗ k≤ ξ and S1,i (θ̃; A) = 0. We complete the proof by noting that
ξ = 2σr−1 (I1,i (A)){kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )} ≤ 2σr−1 (I1,i ) ·
2−1 σr (I1,i (A))C1 = C1 .
By symmetry, we also have the following non-probabilistic and non-asymptotic analysis for Ã.
For each j ∈ [p], the estimating equation for aj based on ΘN2 and ΩN2 · is defined as
X
S2,j (aj ; ΘN2 ) := φ−1 ωij {yij − b0 (aTj θi )}θi . (34)
i∈N2
Let X
B2,j (ΘN2 ) = ωij b00 (m∗ij )θi (θi − θi∗ )T a∗j ∈ Rr , (35)
i∈N2
X
I2,j (ΘN2 ) = ωij b00 (m∗ij )θi (θi )T , (36)
i∈N2
and
X X
β2,j (ΘN2 ) = sup ωij ((θi − θi∗ )T a∗j )2 |θjT u| and γ2,j (ΘN2 ) = sup ωij |θiT u|3 , (37)
kuk=1 i∈N kuk=1 i∈N
2 2
Lemma 17. Let Θ∗N2 and A∗ be such that M∗N2 · = Θ∗N2 (A∗ )T and Z = (zij ) with zij = yij −
b0 (m∗ij ) and diag(ΩN2 ,j ) := diag((ωij )i∈N2 ). If kΘN2 k, kΘ∗N2 k2→∞ ≤ C1 , kA∗ k2→∞ ≤ C2 and
kZTN2 ,j diag(ΩN2 ,j )ΘN2 k+kB2,j (ΘN2 )k+β2,j (ΘN2 )κ3 (3C1 C2 )

n o (38)
≤ min 2−2 γ2,j (ΘN2 )−1 (κ3 (3C1 C2 ))−1 σr2 (I2,j (ΘN2 )), 2−1 σr (I2,j (A))C2
where ZN2 ,j = (zij )i∈N2 , then, there is ã such that S2,j (ã; ΘN2 ) = 0r , and
kãj −a∗j k≤ 2σr−1 (I2,j (ΘN2 ))){kZTN2 ,j diag(ΩN2 ,j )ΘN2 k+kB2,j (ΘN2 )k+β2,j (ΘN2 )κ3 (3C1 C2 )}.
(39)
∗
Moreover, ãj satisfies that kãj − aj k≤ C2 .
Proof [Proof of Lemma 17] The lemma follows similar proof as that of Lemma 15 and Lemma 16
with (A, A∗ , C1 , C2 ) replaced by (ΘN2 , Θ∗N2 , C2 , C1 ). We omit the details.
A.3 Non-asymptotic Probablistic Analysis

Recall that M∗ has the SVD M∗ = U∗r D∗r Vr∗ . In this section, we first provide non-asymptotic
bounds for each term in Lemma 16 with A replaced by Â and A∗ replaced by Vr∗ P̂ where P̂ is de-
fined in (11). Recall that Â = Â(1) is constructed based on M̂N1 · using data {Yij ωij , ωij }i∈N1 ,j∈[p] ,
and thus, independent with {yij , ωij }j∈[p] for all i ∈ N2 . The results in this section hold in general
for any estimator Â that is independent with {ωij , Yij ωij }i∈N2 ,j∈[p] , including the proposed one.
After the analysis for terms in Lemma 16, we provide non-asymptotic analysis for terms in
Lemma 17 with ΘN2 replaced by Θ̃N2 and Θ∗N2 replaced by U∗r D∗r P̂. Unlike Â, Θ̃N2 is dependent
with {yij , ωij }i∈[p] for i ∈ N2 . Thus, we will take a different approach for the error analysis of Θ̃N2 .
22
A.3.1 N ON - ASYMPTOTIC BOUND FOR TERMS IN L EMMA 16

Lemma 18 (Upper bound for kZi· diag(Ωi· )Âk with data splitting). Assume n ≥ 2. kM∗ kmax ≤ ρ
and kÂk2→∞ ≤ C2 . Then, with probability at least 1 − (nr)−1 ,
maxkZi· diag(Ωi· )Âk≤ 8{φ1/2 (κ2 (2ρ+1))1/2 C2 log1/2 (nr)r1/2 p1/2
max ∨r
1/2
φC2 /(ρ+1) log(nr)}
i∈N2
P (40)
where pmax = maxi∈[n] j ωij denotes the maximum number of observations in each row.
Proof [Proof of Lemma 18] We first verify that under the generalized latent factor model,
Zi· diag(Ωi· )Â·k is sub-exponential given ΩN2 · = (ωij )i∈N2 ,j∈[p] and Â. To see this, consider
the moment generating function
E[exp(λZi· diag(Ωi· )Â·k )|ΩN2 · , Â]
Y
= E[λZij âjk ωij |ΩN2 · , Â]
j∈[p]
(41)
h X i
= exp φ−1 ωij {b(m∗ij + λâjk φ) − b(m∗ij ) − λâjk φb0 (m∗ij )}
j
X
−1 2
= exp[2 λ φ ωij b00 (m̃ij )(âjk )2 ]
j
for some m̃ij between m∗ij and m∗ij + λâjk φ. Note that here we used the independence between Â
and {zij ωij }i∈N2 in the first and second equations.
Because |m∗ij |≤ ρ and |âjk |≤ C2 , for |λ|≤ (ρ + 1)/(φC2 ), m̃ij ≤ ρ + λφC2 ≤ 2ρ + 1.
Thus, E[exp(λZi· diag(Ωi· )Â·k )|ΩN2 · , Â] ≤ exp{λ2 φ j ωij (âjk )2 κ2 (2ρ + 1)/2} for |λ|≤ (ρ +
P
1)/(φC2 ). This implies that ZP i· diag(Ωi· )Â·k is sub-exponential (conditional on (ΩN2 · , Â)) with
parameters νik = φκ2 (2ρ + 1) j ωij (âjk )2 ≤ C22 φκ2 (2ρ + 1)pmax and α = φC2 /(ρ + 1).
2
Applying tail probability bound for sub-exponential random variables to Zi· diag(Ωi· )Â·k , we
have
2 2
P(|Zi· diag(Ωi· )Â·k |≥ t|ΩN2 · , Â) ≤ 2(e−t /(2νik ) ∨ e−t/(2α) ) (42)
for all positive t. This implies
P(kZi· diag(Ωi· )Âk≥ t|ΩN2 · , Â)
X √
≤ P(|Zi· diag(Ωi· )Â·k |≥ t/ r|ΩN2 · , Â)
(43)
k∈[r]
2 /(2r max 2 ) 1/2 α)
≤r · 2(e−t k νik
∨ e−t/(2r ).
Combining results for different i with a union bound, we have
2 2 1/2
P maxkZi· diag(Ωi· )Âk≥ t|ΩN1 · , Â ≤ 2rn · (e−t /(2r maxk νik ) ∨ e−t/(2r α) ). (44)
i∈N2
For t = {8(log(nr)r maxk∈[r] νik 2 )1/2 } ∨ 8r 1/2 α log(nr) and n ≥ 2, the right-hand side of the
above inequality is no larger than (nr)−1 . Because νik 2 ≤ φκ (2ρ + 1)C 2 p

2 2 max , we obtain
maxkZi· diag(Ωi· )Âk≤ 8{φ1/2 (κ2 (2ρ+1))1/2 C2 log1/2 (nr)r1/2 pmax

1/2
∨r1/2 φC2 /(ρ+1) log(nr)}
i∈N2
(45)
23
C HEN AND L I
with probability at least 1 − (nr)−1 .
Lemma 19 (Upper bound for kB1,i (Â)k with data splitting). Let A∗ = Vr∗ P̂ and Θ∗ = U∗r D∗r P̂.
If Â is independent with {ωij }j∈[p] for i ∈ N2 , kÂk2→∞ , kVr∗ k2→∞ ≤ C2 and kUr D∗r k2→∞ ≤ C1 ,
then, for n ≥ 4 with probability at least 1 − 1/(nr),
maxkB1,i (Â)k≤ κ∗2 πmax C1 kÂk2 kÂ − A∗ kF +64 log(n) · (πmax

1/2 ∗
κ2 C1 C2 kÂ − A∗ kF +κ∗2 C1 C22 )
i∈N2
(46)
Proof [Proof of Lemma 19] First, by the assumptions and P̂ is orthogonal, kΘ∗ k2→∞ =
kU∗r D∗r k2→∞ ≤ C1 and kA∗ k2→∞ = kVr∗ k2→∞ ≤ C2 . Let
Sj = (ωij − πij )b00 (m∗ij )âj (âj − a∗j )T θi∗ . (47)
Then,
p
X X X
B1,i (Â) = ωij b00 (m∗ij )âj (âj − a∗j )T θi∗ = Sj + πij b00 (m∗ij )âj (âj − a∗j )T θi∗ . (48)
j=1 j∈[p] j∈[p]
Note that Sj are independent mean zero random vectors for j ∈ [p] (conditional on Â) and
kSj k ≤ 4κ∗2 C1 C22 . (49)
This
P allow us rto apply the matrix Bernstein inequality (Equation (6.1.5) in Tropp (2015)) to
j∈[p] Sj ∈ R , and obtain
3t2 3t2
X 3t 3t
P k Sj k≥ t|Â ≤ (r + 1) · e− 8ν ∨ e− 8L ≤ 2r · e− 8ν ∨ e− 8L (50)
j∈[p]
n P o
T T
P
for t > 0 where ν = max j∈[p] E{Sj Sj |Â} , j∈[p] E{Sj Sj |Â} and L =
2 2
4κ∗2 C1 C22 ≥ kSj k for all j. Thus, for any 0 < < r
X
P k Sj k≥ {8/3 · log(2r/)}1/2 ν 1/2 ∨ {(8/3 · log(2r/))L}|Â ≤ . (51)
j∈[p]
Now we find an upper bound for ν. Since
E{Sj STj |Â} = πij (1 − πij ) · {b00 (m∗ij )}2 âj (âj − a∗j )T θi∗ (θi∗ )T (âj − a∗j )âTj , (52)
and
E{STj Sj |Â} = πij (1 − πij ) · {b00 (m∗ij )}2 (θi∗ )T (âj − a∗j )âTj âj (âj − a∗j )T θi∗ , (53)
we have
n o
max kE{STj Sj |Â}k2 , kE{Sj STj |Â}k2 ≤ πmax (κ2 (ρ))2 C12 C22 kâj − a∗j k2 (54)
24
which implies
n X X o
ν = max E{Sj STj |Â} , E{STj Sj |Â} ≤ πmax (κ2 (ρ))2 C12 C22 kÂ − A∗ k2F .
2 2
j∈[p] j∈[p]
(55)
Combine the above inequality with (51), we have that with probability at least 1 − ,
X
k Sj k≤ {8/3 · log(2r/)}1/2 πmax
1/2
κ2 (ρ)C1 C2 kÂ − A∗ kF +{(8/3 · log(2r/))} · 4κ2 (ρ)C1 C22
j∈[p]
(56)
for any 0 < < r. Simplifying this inequality, we get that with probability at least 1 − ,
X
k 1/2
Sj k≤ {16 · log(r/)} · (πmax κ2 (ρ)C1 C2 kÂ − A∗ kF +κ2 (ρ)C1 C22 ) (57)
j∈[p]
for ∈ (0, r/10).

Next, we obtain an upper bound for k j∈[p] πij b00 (m∗ij )âj (âj − a∗j )T θi∗ k as
P
X
k πij b00 (m∗ij )âj (âj − a∗j )T θi∗ k
j∈[p]
X
≤C1 k πij b00 (m∗ij )âj (âj − a∗j )T k2
(58)
j∈[p]
=C1 kÂT diag(πi1 b00 (m∗i1 ), · · · , πip b00 (m∗ip ))(Â − A∗ )k2
≤C1 kÂk2 πmax κ∗2 kÂ − A∗ kF
Combine the above inequality with (48) and (57), we have
kB1,i (Â)k≤ κ∗2 πmax C1 kÂk2 kÂ − A∗ kF +{16 · log(r/)} · (πmax

1/2 ∗
κ2 C1 C2 kÂ − A∗ kF +κ∗2 C1 C22 )
(59)
with probability at least 1 − for ∈ (0, r/10). We complete the proof using a union bound for
i ∈ N2 and = 1/(rn2 ).
Remark 20. The first term κ∗2 πmax C1 kÂk2 kÂ − A∗ kF in the upper bound is the leading term in
the error analysis. To obtain this error bound, we need {ωij }j∈[p] to be independent with Â. In
contrast, if {ωij }j∈[p] are dependent with Â, then the the leading term in the error analysis may be
√
larger (at the order 1/ πmax in the worst case).
Lemma 21 (Upper bound for β1,i (Â) with data splitting). If kU∗r D∗r k2→∞ ≤ C1 ,
kÂk2→∞ , kVr∗ k2→∞ ≤ C2 , and Â is independent with {ωij }i∈N2 ,j∈[p] , then, with probability at
least 1 − 1/n,
max β1,i (Â) ≤ C12 C2 {πmax kÂ − A∗ k2F +4πmax

1/2
C2 (log(n))1/2 kÂ − A∗ kF 4C22 log(n)}. (60)
i∈N2
Proof [Proof of Lemma 21] Recall

X X
β1,i (Â) = sup ωij ((âj − a∗j )T θi∗ )2 |âTj u|≤ C12 C2 ωij kâj − a∗j k2 . (61)
kuk=1 j j∈[p]
25
C HEN AND L I
Conditional on Â, (ωij − πij )kâj − a∗j k2 are independent, mean-zero, bounded by 4C22 , and has the
variance πij (1−πij )kâj −a∗j k4 ≤ 4πij C22 kâj −a∗j k2 . By Bernstein’s inequality for bounded random
variables (Theorem 2.10 in Boucheron et al. (2013) with c = 4C22 /3 and v = 4πij C22 kÂ − A∗ k2F ),
for t > 0
X
P (ωij − πij )kâj − a∗ k2 ≥ (8πij C22 kÂ − A∗ k2F t)1/2 + 4/3 · C22 t|Â ≤ e−t . (62)
j∈[p]
Let t = 2 log(n) in the above inequality and note that πij ≤ πmax and 4/3 < 2, we have that with
probability at least 1 − 1/n2 ,
X
(ωij − πij )kâj − a∗ k2 ≤ 4πmax
1/2
C2 (log(n))1/2 kÂ − A∗ kF +4C22 log(n). (63)
j∈[p]
This implies that with probability at least 1 − 1/n2 ,

X
ωij kâj − a∗j k2
j∈[p]
X
≤ πij kâj − a∗j k2 +4πmax
1/2
C2 (log(n))1/2 kÂ − A∗ kF +4C22 log(n) (64)
j∈[p]
≤πmax kÂ − A∗ k2F +4πmax

1/2
C2 (log(n))1/2 kÂ − A∗ kF +4C22 log(n).
We complete the proof by combining the above inequality with (61) and applying a union bound
for i ∈ N2 .
Remark 22. Similar to Remark 20, the above analysis also requires the independence of {ωij }j∈[p]
and Â in order to obtain the leading term C12 C2 πmax kÂ − A∗ k2F .
Lemma 23 (Upper bound for pmax ). Recall pmax = maxi∈[n] pi . If pπmax ≥ 6 log n, then
P(pmax ≥ 2pπmax ) ≤ 1/n. (65)

P
Proof [Proof of Lemma 23] First note that |ωij − πij |≤ 1 and pi − E(pi ) = j (ωij − πij ). We
apply the Bernstein inequality (Corollary 2.11 in Boucheron et al. (2013)) and obtain
n (pπmax )2 /2 o
P(pi − E(pi ) ≥ pπmax ) ≤ exp −P 2
. (66)
j E(ωij − pij ) + (pπmax )/3
E(ωij − pij )2 =
P P P
Because j j V ar(ωij ) ≤ j πij ≤ pπmax , the above inequality implies,
n (pπmax )2 /2 o 3
P(pi − E(pi ) ≥ pπmax ) ≤ exp − = exp ( − pπmax ), (67)
(pπmax ) + (pπmax )/3 8
which further implies

P(pi ≥ 2pπmax ) ≤ exp(−3pπmax /8). (68)
26
Applying a union bound to the above inequality for i ∈ [n], we obtain
P(max pi ≥ 2pπmax ) ≤ n exp(−3pπmax /8) ≤ 1/n, (69)

i∈[n]
where the last inequality is due to the assumption that pπmax ≥ 6 log n > 16/3 log n.
Lemma 24 (Upper bound of γ1,i (Â)). If kÂk2→∞ ≤ C2 and pπmax > 6 log n, then with probability
at least 1 − 1/n,
γ1,i (Â) ≤ 2pπmax C23 . (70)
Proof [Proof of Lemma 24] The lemma follows by Lemma 23 and the following inequality
p
X
γ1,i (Â) = sup ωij |âTj u|3 ≤ pmax C23 . (71)
kuk=1 i=1
The next three lemmas together give a lower bound for σr (I1,i (Â))
Lemma 25. If kdiag(Ωi· )(Â − A∗ )k2 ≤ 2−1 σr (diag(Ωi· )A∗ ) and kM∗ kmax ≤ ρ, then
σr (I1,i (Â)) ≥ 2−2 δ2 (ρ)σr2 (diag(Ωi· )A∗ ). (72)
Proof [Proof of Lemma 25] For any |u|= 1 and u ∈ Rr ,

p
X p
X
uT I1,i (Â)u = ωij b00 (m∗ij )(uT âj )2 ≥ δ2 (ρ) ωij (uT âj )2 ≥ δ2 (ρ)σr2 (diag(Ωi· )Â). (73)
j=1 j=1
This implies σr (I1,i (Â)) ≥ δ2 (ρ)σr2 (diag(Ωi· )Â). By Weyl’s inequality, σr (diag(Ωi· )Â) ≥
σr (diag(Ωi· )A∗ )−kdiag(Ωi· )(Â−A∗ )k2 . Thus, if kdiag(Ωi· )(Â−A∗ )k2 ≤ 2−1 σr (diag(Ωi· )A∗ ),
then σr (diag(Ωi· )Â) ≥ 2−1 σr (diag(Ωi· )A∗ ), and thus,
σr (I1,i (Â)) ≥ δ2 (ρ)σr2 (diag(Ωi· )Â) ≥ 2−2 δ2 (ρ)σr2 (diag(Ωi· )A∗ ). (74)
The next two lemmas give a lower bound for σr (diag(Ωi· )A∗ ) and an upper bound for
kdiag(Ωi· )(Â − A∗ )k2 .
Lemma 26. Let A∗ = Vr∗ P̂ and let Π1,i = diag(πi1 , · · · , πip ) = E(diag(Ωi· )) and λ∗i,min =
λr ((Vr∗ )T Π1,i Vr∗ ) = λr ((A∗ )T Π1,i A∗ ), where λr (·) denotes the r-th largest eigenvalue of a sym-
metric matrix. If λ∗min := mini∈[n] λ∗i,min ≥ 16kVr∗ k22→∞ log(nr), then

P min σr2 (diag(Ωi· )A∗ ) ≤ 2−1 λ∗min ≤ 1/(nr) (75)
i∈[n]
Moreover, if πmin σr2 (A∗ ) ≥ 32kA∗ k22→∞ log(n) and n ≥ r, then

P min σr2 (diag(Ωi· )A∗ ) ≤ 2−1 πmin σr2 (A∗ ) ≤ 1/(nr). (76)
i∈[n]
27
C HEN AND L I
Remark 27. In the ‘moreover part’ of the above lemma, σr2 (A∗ ) = σr2 (Vr∗ P̂) = 1, so it is possible
to further simplify the statement of lemma. We keep the current form without simplification so that
similar results can be obtained by symmetry for Θ∗ = U∗r D∗r P̂, which will be useful for the analysis
later.
Proof [Proof of Lemma 26] First note that σr2 (diag(Ωi· )A∗ ) = σr2 (diag(Ωi· )Vr∗ P̂) =
σr2 (diag(Ωi· )Vr∗ ) = λr ((Vr∗ )T diag(Ωi· )Vr∗ ). Also note that for all t ∈ (0, 1)

P σr2 (diag(Ωi· )Vr∗ ) ≤ (1 − t)λ∗i,min
X X (77)
=P λr ( ωij vj∗ (vj∗ )T ) ≤ (1 − t) · λr ( πij vj∗ (vj∗ )T ) ,
j j
where vj∗ ∈ Rr denotes the j-th row of Vr∗ . Note that λr {E( j∈[p] ωij vj∗ (vj∗ )T )} = λ∗i,min ,
P
λ1 (ωij vj∗ (vj∗ )T ) ≤ kVr∗ k22→∞ , and ωij vj∗ (vj∗ )T are independent for different j. Applying Remark
5.3 in Tropp (2012) to the above probability, we obtain that for all t ∈ (0, 1),
X n o
P λr ( ωij vj∗ (vj∗ )T ) ≤ (1 − t) · λ∗i,min ≤ r exp − 2−1 kVr∗ k−22→∞ (1 − t)2 ∗
λi,min . (78)
j
Thus,

P σr2 (diag(Ωi· )A∗ ) ≤ (1 − t)λ∗i,min ≤ r exp { − 2−1 kVr∗ k−2 2 ∗
2→∞ (1 − t) λi,min }. (79)
Let t = 1/2 in the above inequality, we obtain

P σr2 (diag(Ωi· )A∗ ) ≤ 2−1 λ∗i,min ≤ r exp { − 8−1 kVr∗ k−2 ∗
2→∞ λi,min }, (80)
which further implies

P σr2 (diag(Ωi· )A∗ ) ≤ 2−1 λ∗min ≤ r exp { − 8−1 kVr∗ k−2 ∗
2→∞ λmin }. (81)
Apply a union bound to the above inequality for different i ∈ [n], we obtain
P( min σr2 (diag(Ωi· )A∗ ) ≤ 2−1 λ∗min ) ≤ nr exp { − 8−1 kVr∗ k−2 ∗
2→∞ λmin }. (82)
i∈[n]
The right-hand side of the above inequality is no greater than (nr)−1 when λ∗min ≥
16kVr∗ k22→∞ log(nr) = 16kA∗ k22→∞ log(nr).
The ‘moreover’ part of the lemma is proved by noting that λ∗i,min = λr ( j∈[p] πij a∗j (a∗j )T ) ≥
P
πmin λr ( j a∗j (a∗j )T ) = πmin σr2 (A∗ ).

P
Lemma 28. If kÂk2→∞ , kVr∗ k2→∞ ≤ C2 and Â is independent with {ωij }i∈N2 ,j∈[p] , then with
probability at least 1 − 1/(nr),
maxkdiag(Ωi· )(Â − A∗ )k22 ≤ πmax kÂ − A∗ k2F +64 log(n) · {(πmax

1/2
C2 kÂ − A∗ kF ) ∨ C22 } (83)
i∈N2
for n ≥ 4.
28
Proof [Proof of Lemma 28] Let ∆aj = âj − a∗j and ∆A = Â − A∗ =

(∆Ta1 , · · · , ∆Tap )T . Conditional on Â, (ωij − πij )∆aj ∆Taj are independent symmetric matrices
satisfying k(ωij − πij )∆aj ∆Taj k2 ≤ k∆aj k22→∞ ≤ 4C22 , and kE{(ωij ∆aj ∆Taj )T ωij ∆aj ∆Taj }k2 ≤
2 2 2 2
Pij k∆aj k2→∞ k∆aj k ≤ T 4πij C2 k∆aj k . Applying the inequality (6.1.5) in Tropp (2015) to
π
j∈[p] (ωij − πij )∆aj ∆aj , we obtain that for all t > 0
X n 3t2 3t o
P k (ωij − πij )∆aj ∆Taj k2 ≥ t|Â ≤ 2r · exp − ∧ (84)
8ν 8L
j∈[p]
where ν = 4πmax C22 k∆A k2F ≥ j∈[p] kE[{ωij ∆aj ∆Taj }T ωij ∆aj ∆Taj ]k and L = 4C22 ≥ k(ωij −
P
πij )∆aj ∆Taj k2 .

For ∈ (0, 1), let t = [{8/3 · log(2r/)}1/2 ν 1/2 ] ∨ [{8/3 · log(2r/)}L] in the above inequality,
we obtain X
P k (ωij − πij )∆aj ∆Taj k2 ≥ t|Â ≤ . (85)
j∈[p]
Now we give an upper bound for t = [{8/3 · log(2r/)}1/2 ν 1/2 ] ∨ [{8/3 · log(2r/)}L] for ∈
(0, r/10)
[{8/3 · log(2r/)}1/2 ν 1/2 ] ∨ [{8/3 · log(2r/)}L]

≤8 log(r/) · (ν 1/2 ∨ L) (86)
1/2
≤32 log(r/) · {(πmax C2 k∆A kF ) ∨ C22 }.
Thus, with probability at least 1 − ,
X
k (ωij − πij )∆aj ∆Taj k2 ≤ 32 log(r/) · {(πmax
1/2
C2 k∆A kF ) ∨ C22 } (87)
j∈[p]
for ∈ (0, r/10). Applying a union bound to the above result with = 1/(rn2 ), we have
X
k (ωij − πij )∆aj ∆Taj k2 ≤ 64 log(n) · {(πmax
1/2
C2 k∆A kF ) ∨ C22 } (88)
j∈[p]
with probability at least 1 − 1/(nr) for allPi ∈ N2 and n ≥ 4.

Next, we give an upper bound for λ1 ( pj=1 πij ∆aj ∆Taj ).
Xp Xp
T
λ1 ( πij ∆aj ∆aj ) ≤ πmax λ1 ( ∆aj ∆Taj ) = πmax k∆A k22 ≤ πmax k∆A k2F . (89)
j=1 j=1
Combining the above two inequalities and note that kdiag(Ωi· )(Â − A∗ )k22 =
λ1 ( j∈[p] ωij ∆aj ∆Taj ), we obtain that with probability at least 1 − 1/(nr),
P
kdiag(Ωi· )(Â − A∗ )k22 ≤ πmax k∆aj k2F +64 log(n) · {(πmax

1/2
C2 k∆A kF ) ∨ C22 } (90)
for n ≥ 4.
29
C HEN AND L I
A.3.2 N ON - ASYMPTOTIC BOUND FOR TERMS IN L EMMA 17

P
Let nmax = maxj∈[p] i∈[n] ωij be the maximal number of observations in each column.
Lemma 29. If nπmax ≥ 6 log(p), then P(nmax ≥ 2nπmax ) ≤ 1/p.
Proof [Proof of Lemma 29] The proof is similar to that of Lemma 24. We ommit the details.
Lemma 30. With probability at least 1 − 1/(np), kZkmax ≤ 8 log(np){(φκ∗2 )1/2 ∨ 1}

Proof [Proof of Lemma 30] Note that the moment generating function for zij is E(exp(λzij )) =
exp{φ−1 (b(m∗ij + λ) − b(m∗ij ) − λb0 (mij ∗ ))} = exp{2−1 λ2 φb00 (m̃ij )} for some m̃ij between m∗ij
and m∗ij + λ. Thus, zij is sub-exponential with ν 2 = φκ∗2 and α = 1, which implies P(|Zij |≥ t) ≤
2 ∗
2e−t /(2φκ2 ) ∨ e−t/2 . Thus,
2 /(2φκ∗ )
P(kZkmax ≥ t) ≤ 2(np)(e−t 2 ∨ e−t/2 ) (91)
Let t = 8 log(np){(φκ∗2 )1/2 ∨ 1} in the above probability bound. We see that the right-hand side is
no larger than (np)−1 .
Lemma 31 (Upper bound for kZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k). Assume that nπmax ≥ 6 log(p). With
probability at least 1 − 3/p − P(kΘ̃N2 k2→∞ > 2C1 ),
maxkZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k
j∈[p]
≤16{φ1/2 (κ∗2 )1/2 C1 log1/2 (pr)r1/2 (nπmax )1/2 ∨ r1/2 φC1 /(ρ + 1) log(pr)} (92)
+ 16kΘ̃N2 − Θ∗N2 k2→∞ ·nπmax log(np){(κ∗2 φ)1/2 ∨ 1}
on the event {kΘ̃N2 k2→∞ ≤ 2C1 }.

Proof [Proof of Lemma 31] With similar derivations as that for the inequality (40), we have that
with probability at least 1 − 1/(pr),
maxkZTN2 ,j diag(ΩN2 ,j )Θ∗N2 k≤ 16{φ1/2 (κ∗2 )1/2 C1 log1/2 (pr)r1/2 n1/2
max ∨r
1/2
φC1 /(ρ+1) log(pr)}.
j∈[p]
(93)
Note that
X
kZTN2 ,j diag(ΩN2 ,j )(Θ̃N2 − Θ∗N2 )k= k ωij zij (θ̃i − θi∗ )k≤ kΘ̃N2 − Θ∗N2 k2→∞ kZkmax nmax .
i∈N2
(94)
Thus, with probability at least 1 − 1/(pr),
j∈[p]
≤ maxkZTN2 ,j diag(ΩN2 ,j )Θ∗N2 k+ maxkZTN2 ,j diag(ΩN2 ,j )(Θ̃N2 − Θ∗N2 )k

j∈[p] j∈[p] (95)
≤16{φ (κ∗2 )1/2 C1 log1/2 (pr)r1/2 (nmax )1/2
1/2
∨r 1/2
φC1 /(ρ + 1) log(pr)}
+ kΘ̃N2 − Θ∗N2 k2→∞ kZkmax nmax
30
Combine the above display with Lemma 29 and Lemma 30, we have that with probability at least
1 − 3/p,

j∈[p]
≤16{φ1/2 (κ∗2 )1/2 C1 log1/2 (pr)r1/2 (nπmax )1/2 ∨ r1/2 φC1 /(ρ + 1) log(pr)} (96)
Lemma 32 (Upper bound for kB2,j (Θ̃N2 )k). Assume that nπmax ≥ 6 log(p). With probability at
least 1 − 1/p,
maxkB2,j (Θ̃N2 )k≤ 4C1 C2 κ∗2 nπmax kΘ̃N2 − Θ∗N2 k2→∞ , (97)
j∈[p]

X
kB2,j (Θ̃N2 )k= k ωij b00 (m∗ij )θ̃i (θ̃i − θi∗ )T a∗j k≤ 2C1 C2 kΘ̃N2 − Θ∗N2 k2→∞ max b00 (m∗ij )nmax
ij
i∈N2
(98)
According to Lemma 29 and noting that maxij b00 (m∗ij ) ≤ κ∗2 , we further have that with probability
at least 1 − 1/p,
maxkB2,j (Θ̃N2 )k≤ 4C1 C2 κ∗2 nπmax kΘ̃N2 − Θ∗N2 k2→∞ (99)
j∈[p]
Lemma 33. Assume that nπmax ≥ 6 log(p). With probability at least 1 − 1/p,
max β2,j (Θ̃N2 ) ≤ 4C1 C22 kΘ − Θ∗ k22→∞ nπmax (100)

j∈[p]

X
β2,j (Θ̃N2 ) = sup ωij ((θ̃i − θi∗ )T a∗j )2 |θ̃jT u|≤ 2C1 C22 kΘ̃ − Θ∗ k22→∞ nmax (101)
kuk=1 i∈N
2
The proof is completed by combining the above inequality with Lemma 29
Lemma 34. Assume that nπmax ≥ 6 log(p). With probability at least 1 − 1/p,
max γ2,j (Θ̃N2 ) ≤ 16C13 nπmax (102)

j∈[p]
31
C HEN AND L I

X
γ2,j (Θ̃N2 ) = sup ωij |θ̃iT u|3 ≤ 8C13 nmax (103)
kuk=1 j
Combine this with Lemma 29, we complete the proof.
Lemma 35. Assume that P(kΘ̃N2 − Θ∗N2 k2→∞ ≤ eΘ,2→∞ ) ≥ 1 − for some non-random
eΘ,2→∞ , nπmax ≥ 6 log(p), πmin σr2 (Θ∗N2 ) ≥ 32kΘ∗N2 k22→∞ log(p), p ≥ r, and 2e2Θ,2→∞ nπmax ≤
2−3 πmin σr2 (Θ∗N2 ). Then, with probability at least 1 − 2/p −
I2,j (Θ̃N2 ) ≥ 2−2 δ2 (ρ)πmin σr2 (Θ∗ ) ≥ 2−2 δ2 (ρ)πmin ψr2 (104)
Proof [Proof of Lemma 35] First note that
X
kdiag(ΩN2 ,j )(Θ̃N2 − Θ∗N2 )k22 = k ωij (θ̃i − θi∗ )(θ̃i − θi∗ )T k2 ≤ kΘ̃N2 − Θ∗N2 k22→∞ ·nmax
i∈N2
(105)
Combine the above inequality with Lemma 29, we have that with probability at least 1 − 1/p,
kdiag(ΩN2 ,j )(Θ̃N2 − Θ∗N2 )k22 ≤ 2kΘ̃N2 − Θ∗N2 k22→∞ ·nπmax . (106)
On the other hand, with similar argument as those in the proof of Lemma 26, we have that if
πmin σr2 (Θ∗N2 ) ≥ 32kΘ∗N2 k22→∞ log(p) and p ≥ r, then

P min σr2 (diag(ΩN2 ,j )Θ∗N2 ) ≤ 2−1 πmin σr2 (Θ∗N2 ) ≤ 1/(pr) (107)
i∈[n]
Thus, if 2e2Θ,2→∞ nπmax ≤ 2−3 πmin σr2 (Θ∗N2 ), then with probability at least 1 − − 2/p,
kdiag(ΩN2 ,j )(Θ̃N2 − Θ∗N2 )k2 ≤ 2−1 σr (diag(ΩN2 ,j )Θ∗N2 ).

With similar arguments as those for Lemma 25, we have that with probability 1 − 2/p − ,
min I2,j (Θ̃N2 ) ≥ 2−2 δ2 (ρ)πmin σr2 (Θ∗N2 ) ≥ 2−2 δ2 (ρ)πmin ψr2 (108)
j∈[p]
where the last inequality in the above display holds because Θ∗N2 = (U∗r )N2 · D∗r and as a result
σr (Θ∗ ) = σr (M∗N2 · ) ≥ ψr .
A.3.3 B OUNDS FOR ψ1 AND ψr

Lemma 36. Let R = UDVT be the singular value decomposition of a non-random matrix R with
U ∈ Rn×r , V ∈ Rp×r and D = diag(σ1 (R, · · · , σr (R)), and let gi ∼ Bernoulli(1/2) be i.i.d.
random variables.
Then,
P(σr2 (RG ) ≤ 2−2 σr2 (R)) ≤ r exp [−2−3 σr2 (R)/{kUk22→∞ σ12 (R)}], (109)
where G = {i : gi = 1} and RG = (rij )i∈G . In particular, if σr2 (R)/{kUk22→∞ σ12 (R)} log(r),
then with probability converging to 1, σr (R) . σr (RG ) ≤ σ1 (RG ) ≤ σ1 (R).
32
Proof First, as RG is a submatrix of R, we have σ1 (RG ) ≤ σ1 (R). In the rest of the proof, we
show that (109) holds. Let T = P UD ∈ Rn×r . Then, RG = TG VT and σr2 (RG ) = λr (RG RTG ) =
λr (TG TTG ) = λr (TTG TG ) = λr ( i∈[n] gi ti tTi ) where ti = TTi· indicates the i-th row of the matrix
T.
Note
P that for each i, gi ti tTi is positive semi-definite, and λ1 (gi ti tTi ) ≤ kti k2 ≤ kTk22→∞ . Also,
λr (E( i∈[n] gi ti ti )) = 2−1 λr (TT T) = 2−1 σr2 (R). Applying the weak Chernoff bounds for
T
matrices (inequalities on page 61 of Tropp (2015) under equations (5.1.7) with t = 1/2), we obtain
−3 σ 2 (R)/kTk2
X
P(λr ( gi ti tTi ) ≤ 2−2 σr2 (R)) ≤ re−2 r 2→∞ . (110)
i∈[n]
We complete the proof by noting that kTk2→∞ ≤ kUk2→∞ σ1 (R).
A.4 Asymptotic Analysis

In this section, we provide asymptotic analysis of the estimators based on the non-asymptotic
bounds established in previous sections.
Lemma 37 (Asymptotic bounds for ψ1 and ψr ). Recall that ψ1 = σ1 (M∗N1 · ) ∨ σ1 (M∗N2 · ) and
ψr = σr (M∗N1 · ) ∧ σr (M∗N2 · ). If σr2 (M∗ )/σ12 (M∗ ) kU∗r k22→∞ log(r), then with probability
converging to 1, σr (M∗ ) . ψr ≤ ψ1 ≤ σ1 (M∗ ).
Proof [Proof of Lemma 37] This lemma is a direct application of Lemma 36 with R, U, and G
replaced by M∗ , U∗r and N1 (or N2 ). We omit the details.
Lemma 38 (Asymptotic analysis for Θ̃N2 ). Let A∗ = Vr∗ P̂, Θ∗ = U∗r D∗r P̂, where P̂ is defined in
(11). Assume that limn,p→∞ P(kÂ − A∗ kF ≤ eA,F ) = 1. Assume the following asymptotic regime
holds:
1. φ . 1;
2. kU∗r k2→∞ . (r/n)1/2 , kVr∗ k2→∞ . (r/p)1/2 , C2 ∼ (r/p)1/2 ;
3. (np)1/2 rη2 . σr (M∗ ) ≤ σ1 (M∗ ) . (np)1/2 rη1 , for constants η1 and η2 ;
4.
pπmin
(δ2∗ )−4 (κ∗2 )2 (log(n))2
· max {r1∨(1+2η1 )∨(1−2η2 ) (πmax /πmin ), (κ∗3 )2 (πmax /πmin )3 r5∨(3+2η1 )∨(3+4η1 ) };
5. eA,F (κ∗2 )−1 (δ2∗ )2 min {r−(η1 −η2 ) (πmin /πmax ), (κ∗3 )−1 r−2−η1 (πmin /πmax )2 };
6. and n r1+2(η1 −η2 ) log(r).
33
C HEN AND L I
Then, with probability converging to 1, there is Θ̃N2 = (θ̃iT )i∈N2 ∈ R|N2 |×r such that S1,i (θ̃i ; Â) =
0 for all i ∈ N2 , and
kΘ̃N2 − Θ∗N2 k2→∞ . κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 + r1/2+η1 eA,F }.
(111)
Moreover, Θ̃N2 defined above satisfies kΘ̃N2 − Θ∗N2 k2→∞ ≤ C1 , and θ̃i is the unique solution to
the optimization problem maxθi ∈Rr j∈[p] ωij {yij θiT âj − b(θiT âj )} for all i ∈ N2 .
P

First, we provide analysis on the asymptotic regime. Note that κ∗2 ≥ κ2 (0) & 1 and δ2∗ ≤
δ2 (0) . 1. Then, the 4-th requirement on the asymptotic regime, i.e.,
pπmin
(δ2∗ )−4 (κ∗2 )2 (log(n))2 max {r1∨(1+2η1 )∨(1−2η2 ) (πmax /πmin ), (κ∗3 )2 (πmax /πmin )3 r5∨(3+2η1 )∨(3+4η1 ) };
(112)
implies the following asymptotic regimes,



 max[log(n), r(log n)2 , r1+2η1 log(n)],

∗ 2 ∗ −2 3+2η1 log(n),
(κ3 ) (κ2 ) r



pπmin (κ∗3 )2 (κ∗2 )−2 r3+4η1 log(n), (113)

(κ∗2 )2 (κ∗3 )2 (δ2∗ )−4 (πmax /πmin )3 r5 (log(n)),





(πmax /πmin )(κ∗2 )2 (δ2∗ )−2 r1−2η2 log(n).

Similarly, the 5-th requirement on the asymptotic regime, i.e.,
eA,F (κ∗2 )−1 (δ2∗ )2 min {r−(η1 −η2 ) (πmin /πmax ), (κ∗3 )−1 r−2−η1 (πmin /πmax )2 } (114)
implies 


r−1−η1 (κ∗3 )−1 κ∗2 ,
(π /π )1/2 ,

min max
eA,F (115)


(κ2 ) δ2 r−(η1 −η2 ) (πmin /πmax ),
∗ −1 ∗
(κ∗ )−1 (κ∗ )−1 (δ ∗ )2 r−2−η1 (π /π )2 ,

3 2 2 min max
because η1 − η2 ≥ 0 and −1 − 2η1 > −2 − η1 . According to the 6-th asymptotic requirement,

n r1+2(η1 −η2 ) log(r), which implies σr2 (M∗ )/σ12 (M∗ ) kU∗r k22→∞ log(r) and the assumption
for Lemma 37 holds. Thus, with probability converging to 1,
(np)1/2 rη2 . ψr ≤ ψ1 ≤ (np)1/2 rη1 . (116)
Also, we have
r1/2+η2 p1/2 . C1 . r1/2+η1 p1/2 , C2 . r1/2 p−1/2 , and C1 C2 . r1+η1 . (117)
Throughout the proof, we restrict the analysis on the event {kÂ − A∗ kF ≤ eA,F } ∩ {pmax ≤
2pπmax } ∩ {(np)1/2 rη2 . ψr ≤ ψ1 ≤ (np)1/2 rη1 }, which has probability converging to 1 by the
34
lemma’s assumption, (113), (116), and Lemma 24. On this event, we have that with probability at
least 1 − 1/n,
maxkZi· diag(Ωi· )Âk≤ 32{φ1/2 (κ∗2 )1/2 C2 log1/2 (n)r1/2 (pπmax )1/2 ∨ r1/2 φC2 /(ρ + 1) log(n)},
i∈N2
(118)
according to Lemma 18. Under the asymptotic regime that φ . 1, C2 . (r/p)1/2 , the above
inequality implies
maxkZi· diag(Ωi· )Âk. (κ∗2 )1/2 r log1/2 (n)πmax

1/2
∨ rp−1/2 log(n). (119)
i∈N2
Note that κ∗2 & 1. According to (113), pπmin r(log n)2 , which implies rp−1/2 log(n)
1/2
(κ∗2 )1/2 r log1/2 (n)πmax . Thus, the above display implies
maxkZi· diag(Ωi· )Âk. (κ∗2 )1/2 r log1/2 (n)πmax

1/2
. κ∗2 r log1/2 (n)πmax
1/2
(120)
i∈N2
with probability converging to 1. Next, according to Lemma 19, with probability converging to 1,
we have
maxkB1,i (Â)k
i∈N2
. (121)
≤κ∗2 πmax C1 kÂk2 kÂ − A∗ kF +64 log(n) · (πmax
1/2 ∗
κ2 C1 C2 kÂ − A∗ kF +κ∗2 C1 C22 log(n)).
According to (117), C1 C22 . r3/2+η1 p−1/2 . Also, note that kÂk2 ≤ 1. Thus, the above display
implies that with probability converging to 1,
maxkB1,i (Â)k. κ∗2 {πmax r1/2+η1 p1/2 eA,F + r1+η1 (πmax )1/2 log(n)eA,F + r3/2+η1 p−1/2 log(n)}.
i∈N2
(122)
1/2
According to (113), pπmin r(log n)2 , which implies πmax r1+η1 log(n) πmax r1/2+η1 p1/2 .
Thus, (122) implies that with probability converging to 1,
maxkB1,i (Â)k. κ∗2 (πmax r1/2+η1 p1/2 eA,F + r3/2+η1 p−1/2 log(n)). (123)
i∈N2
According to (113), pπmin r1+2η1 log(n), which implies r3/2+η1 p−1/2 log(n) .
1/2 1/2
r log (n)πmax . This, together with equations (120) and (123), we have
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k} . κ∗2 {r log1/2 (n)πmax

1/2
+ πmax r1/2+η1 p1/2 eA,F } (124)
i∈N2
with probability converging to 1.

We proceed to the analysis of maxi∈N2 β1,i (Â)κ∗3 . According to Lemma 21, with probability
1 − 1/n
max β1,i (Â) ≤ C12 C2 {πmax kÂ − A∗ k2F +4πmax

1/2
C2 (log(n))1/2 kÂ − A∗ kF +4C22 log(n)}.
i∈N2
(125)
35
C HEN AND L I
Note that C12 C2 . r3/2+2η1 p1/2 . Thus, the above display implies
max β1,i (Â)κ∗3 ≤ κ∗3 r3/2+2η1 p1/2 {πmax e2A,F + πmax

1/2 1/2 −1/2
r p (log(n))1/2 eA,F + rp−1 log(n)}.
i∈N2
(126)
First, according to (115), eA,F . r−1−η1 (κ∗3 )−1 κ∗2 , which implies κ∗3 r3/2+2η1 p1/2 πmax e2A,F .
κ∗2 πmax r1/2+η1 p1/2 eA,F . Second, according to (113), pπmin (κ∗3 )2 (κ∗2 )−2 r3+2η1 log(n),
1/2
which implies κ∗3 r3/2+2η1 p1/2 · πmax r1/2 p−1/2 (log(n))1/2 eA,F . κ∗2 πmax r1/2+η1 p1/2 eA,F .
Third, according to (113), pπmin (κ∗3 )2 (κ∗2 )−2 r3+4η1 log(n), which implies κ∗3 r3/2+2η1 p1/2 ·
1/2
rp−1 log(n) κ∗2 r log1/2 (n)πmax . Thus, (126) implies that with probability converging to one,
max β1,i (Â)κ∗3 . κ∗2 {r log1/2 (n)πmax

1/2
+ πmax r1/2+η1 p1/2 eA,F }. (127)
i∈N2
Equations (124) and (127) together imply that with probability converging to 1
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 } . κ∗2 {r log1/2 (n)πmax

1/2
+ πmax r1/2+η1 p1/2 eA,F }.
i∈N2
(128)
Next, we find a lower bound for σr (I1,i (Â)). Note that σr (A∗ ) = 1 and kA∗ k22→∞ . r/p
by assumption. Under the asymptotic regime that pπmin r(log(n))2 , πmin σr2 (A∗ ) ≥
32kA∗ k22→∞ log(n) for n large enough. According to Lemma 26, with probability at least
1 − 1/(nr),
min σr2 (diag(Ωi· )A∗ ) ≥ 2−1 πmin (129)
i∈N2
for n and p large enough. According to Lemma 28, with probability converging to 1,
maxkdiag(Ωi· )(Â − A∗ )k22 . πmax e2A,F + πmax

1/2
(r/p)1/2 log(n)eA,F + (r/p) log(n). (130)
i∈N2
First, according to (115), eA,F (πmin /πmax )1/2 , which implies πmax e2A,F πmin . Sec-
ond, according to (113) and (115), eA,F (πmin /πmax )1/2 and πmin p r(log(n))2 ,
which implies eA,F (πmin /πmax )1/2 (πmin p)1/2 r−1/2 (log(n))−1 . This further implies
1/2
πmax (r/p)1/2 log(n)eA,F πmin . Third, according to (113), pπmin r(log(n))2 , which im-
plies (r/p) log(n) πmin . Combining the analysis, we have that with probability converging to
one,
maxkdiag(Ωi· )(Â − A∗ )k22 πmin . (131)
i∈N2
Combining the above display with (129) and using Lemma 25, we have that with probability con-
verging to 1,
min σr (I1,i (Â)) ≥ 2−3 δ2∗ πmin . (132)
i∈N2
So far, we have obtained upper bounds for maxi∈N2 {kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 }
and a lower bound for σr (I1,i (Â)). In the rest of the proof, we restrict our analysis on the event that
(128) and (132) hold. To proceed, we verify conditions of of Lemma 16. According to Lemma 24,
36
on the event pmax ≤ 2pπmax , maxi∈N2 γ1,i (Â) . pπmax (r/p)3/2 . This and (132) implies with
probability tending to 1
n o
min (γ1,i (Â))−1 (κ3 (3C1 C2 ))−1 σr2 (I1,i (Â))
i∈N2
&(pπmax )−1 (r/p)−3/2 (κ∗3 )−1 πmin

2
(δ2∗ )2 (133)
=(κ∗3 )−1 (δ2∗ )2 p1/2 r−3/2 πmin

2
/πmax .
According to (113), pπmin (κ∗2 )2 (κ∗3 )2 (δ2∗ )−4 (πmax /πmin )3 r5 (log(n)), which im-
1/2
plies κ∗2 r log1/2 (n)πmax (κ∗3 )−1 (δ2∗ )2 p1/2 r−3/2 πmin
2 /π
max . According to (115)
∗ −1 ∗ −1
eA,F (κ3 ) (κ2 ) (δ2 ) r ∗ 2 −2−η 2 ∗π 1/2+η1 p1/2 e
A,F
1 (π /π ) κ
min max , which implies 2 max r
∗ −1 ∗ 2 1/2 −3/2 2 ∗ 1/2 1/2
(κ3 ) (δ2 ) p r πmin /πmax . Combining the analysis, we have κ2 r log (n)πmax +
κ∗2 πmax r1/2+η1 p1/2 eA,F (κ∗3 )−1 (δ2∗ )2 p1/2 r−3/2 πmin 2 /π
max . This, together with (133) implies
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 } min {(γ1,i (Â))−1 (κ3 (3C1 C2 ))−1 σr2 (I1,i (Â))}.
i∈N2 i∈N2
(134)
Next, according to (132) and C1 = {kU∗r k2→∞ ∨(r/n)1/2 } · σ1 (M∗ )
min {σr (I1,i (Â))C1 } & δ2∗ πmin (r/n)1/2 (np)1/2 rη2 & δ2∗ πmin r1/2+η2 p1/2 . (135)
i∈N2
According to (113), pπmin (πmax /πmin )(κ∗2 )2 (δ2∗ )−2 r1−2η2 log(n), which im-
1/2
plies κ∗2 r log1/2 (n)πmax δ2∗ πmin r1/2+η2 p1/2 . According to (115), eA,F
∗ −1 ∗
(κ2 ) δ2 (πmin /πmax )r −(η1 −η2 ) , which implies κ∗2 πmax r1/2+η1 p1/2 eA,F δ2∗ πmin r1/2+η2 p1/2 .
Combining the analysis and (133), we get
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 } min {σr (I1,i (Â))C1 }. (136)
i∈N2 i∈N2
According to (134) and (136), conditions of Lemma 16 are satisfied. According to Lemma 16
and (128) and (132), with probability converging to 1, there exists Θ̃N2 = (θ̃iT )i∈N2 ∈ R|N2 |×r
such that S1,i (θ̃i ; Â) = 0 for all i ∈ N2 , and
kΘ̃N2 − Θ∗N2 k2→∞

h i
≤ max (σr (I1,i (Â)))−1 {kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 }
i∈N2 (137)
.(δ2∗ πmin )−1 κ∗2 {r log1/2 (n)πmax
1/2
+ πmax r1/2+η1 p1/2 eA,F }
=κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 + r1/2+η1 eA,F },
and kΘ̃N2 − Θ∗N2 k2→∞ ≤ C1 . Moreover, θ̃i described above is the unique solution to to the
optimization problem maxθi ∈Rr j∈[p] ωij {yij θiT âj − b(θiT âj )} for all i ∈ N2 because this
P
optimization is strictly convex by (132).
Lemma 39 (Asymptotic analysis for Ã). Assume that limn,p→∞ P(kΘ̃N2 − Θ∗N2 k2→∞ ≤
eΘ,2→∞ ) = 1. Assume the the following asymptotic regime holds,
37
C HEN AND L I
1. φ . 1;
3. (np)1/2 rη2 . σr (M∗ ) ≤ σ1 (M∗ ) . (np)1/2 rη1 ;
4.
nπmin
(κ∗2 )2 (δ2∗ )−4 (log(np))2 (138)
· max {(πmax /πmin )r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (κ∗3 )2 (πmax /πmin )3 r5+8η1 −8η2 };
5. eΘ,2→∞ ≤ C1 and
eΘ,2→∞
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1
· min{(πmin /πmax )r(−1/2−η1 +2η2 )∧(1/2+2η2 ) , (κ∗3 )−1 (πmin /πmax )2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) }.
(139)
Then, with probability converging to 1, there is Ã = (ãTj )j∈[p] ∈ Rp×r such that S2,j (ãj ; Θ̃N2 ) = 0
for all j ∈ [p], kÃ − A∗ k≤ C2 , and
kÃ − A∗ k2→∞
n o
.κ∗2 (δ2∗ )−1 (πmax /πmin )r−2η2 log(np)p−1/2 r1+η1 (nπmax )−1/2 + r(1+η1 )∨0 p−1/2 eΘ,2→∞ .
(140)
Moreover, Pãj defined above is the unique solution to the optimization problem
maxaj ∈Rr i∈N2 ωij {yij θiT âj − b(θiT âj )} for all j ∈ [p].
Proof [Proof of Lemma 39] First, the 4-th condition on the asymptotic regime, i.e.,
nπmin
(κ∗2 )2 (δ2∗ )−4 (log(np))2 max {(πmax /πmin )r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (κ∗3 )2 (πmax /πmin )3 r5+8η1 −8η2 }
(141)
implies the following asymptotic regime holds




 log(p),
r1+2η1 −2η2 log(p),

nπmin (142)
(κ∗2 )2 (κ∗3 )2 (δ2∗ )−4 (πmax /πmin )3 r5+8η1 −8η2 (log(np))2 ,


(κ∗ )2 (δ ∗ )−2 (π /π )r1+2η1 −4η2 log2 (np),

2 2 max min
and n r1+2(η1 −η2 ) log(r), which ensures that the conditions of Lemma 37 holds, and thus,
(np)1/2 rη2 . ψr ≤ ψ2 . (np)1/2 rη2 with probability converging to 1.
38
The 5-th condition on the asymptotic regime, i.e.,
eΘ,2→∞
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1
· min{(πmin /πmax )r(−1/2−η1 +2η2 )∧(1/2+2η2 ) , (κ∗3 )−1 (πmin /πmax )2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) }
(143)
implies

1/2 1/2+η2 . C ,
p r

 1
∗ ∗ −1 −1/2 1/2
κ2 (κ3 ) r p log(np),



eΘ,2→∞ (πmin /πmax ) p1/2 rη2 ,
1/2

(κ∗2 )−1 (κ∗3 )−1 (δ2∗ )2 (πmin /πmax )2 p1/2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) (log(np))−1 ,





 ∗ −1 ∗
(κ2 ) δ2 (πmin /πmax )r(−1/2−η1 +2η2 )∧(1/2+2η2 ) (log(np))−1 p1/2 ,
(144)
where we used η2 > −1/2 − η1 + 2η2 because η1 − η2 ≥ 0.
Throughout the proof, we restrict the analysis on the event kΘ̃N2 −Θ∗N2 k2→∞ ≤ eΘ,2→∞ ≤ C1 ,
which has probability converging to 1 as n, p → ∞, according to the assumption of the lemma
and (144). This also implies that kΘ̃N2 k≤ 2C1 with probability converging to 1. According to
Lemma 31 and under the asymptotic regime nπmax log(p), with probability converging to 1,

j∈[p]
≤16{φ1/2 (κ∗2 )1/2 C1 log1/2 (pr)r1/2 (nπmax )1/2 ∨ r1/2 φC1 /(ρ + 1) log(pr)}
(145)
.(κ∗2 )1/2 p1/2 r1/2+η1 log 1/2
(p)r 1/2
(nπmax ) 1/2
+r 1/2 1/2 1/2+η1
p r log(p)}
+ eΘ,2→∞ nπmax log(np)(κ∗2 )1/2
.(κ∗2 )1/2 r1+η1 p1/2 n1/2 πmax
1/2
log1/2 (p) + eΘ,2→∞ nπmax log(n ∨ p)(κ∗2 )1/2 ,
where we used r1/2 p1/2 r1/2+η1 log(p) . p1/2 r1+η1 log1/2 (p)(nπmax )1/2 under the asymptotic
regime nπmax log(p) for the last inequality.
According to Lemma 32, with probability converging to 1,
maxkB2,j (Θ̃N2 )k≤4C1 C2 κ∗2 nπmax kΘ̃N2 − Θ∗N2 k2→∞ . κ∗2 r1+η1 nπmax eΘ,2→∞ (146)
j∈[p]
According to Lemma 33, with probability converging to 1,
max β2,j (Θ̃∗N2 ) ≤ 4C1 C22 kΘ̃N2 − Θ∗N2 k22→∞ nπmax . r3/2+η1 p−1/2 e2Θ,2→∞ nπmax . (147)
j∈[p]
39
C HEN AND L I
Combining the above analysis, we obtain that with probability converging to 1,
max{kZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k+kB2,j (Θ̃N2 )k+β2,j (Θ̃N2 )κ∗3 }

j∈[p]
.(κ∗2 )1/2 r1+η1 p1/2 n1/2 πmax

1/2
log1/2 (p) + eΘ,2→∞ nπmax log(n ∨ p)(κ∗2 )1/2
+ κ∗2 r1+η1 nπmax eΘ,2→∞ + r3/2+η1 p−1/2 e2Θ,2→∞ nπmax κ∗3 (148)
.(κ∗2 )1/2 r1+η1 p1/2 n1/2 πmax

1/2
log1/2 (p)
+ κ∗2 r(1+η1 )∨0 log(np)nπmax eΘ,2→∞ + r3/2+η1 p−1/2 e2Θ,2→∞ nπmax κ∗3 .
Under the asymptotic regime that eΘ,2→∞ . κ∗2 (κ∗3 )−1 r−1/2 p1/2 log(np),
r 3/2+η1p−1/2 2 ∗ ∗ 1+η
eΘ,2→∞ nπmax κ3 . κ2 r 1 log(np)nπmax eΘ,2→∞ . Thus, the above inequal-
ity implies
max{kZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k+kB2,j (Θ̃N2 )k+β2,j (Θ̃N2 )κ∗3 }

j∈[p]
(149)
.κ∗2 r1+η1 p1/2 n1/2 πmax
1/2
log(np) + κ∗2 r(1+η1 )∨0 log(np)nπmax eΘ,2→∞ .
Next, we derive a lower bound for σr (I2,j (Θ̃N2 )). Under the asymptotic regime nπmin
r1+2η1 −2η2 log(p), and eΘ,2→∞ (πmin /πmax )1/2 p1/2 rη2 , we have nπmax log(p),
πmin (np)r2η2 r1+2η1 p log(p), and e2Θ,2→∞ nπmax πmin (np)r2η2 . Note that σr2 (Θ∗N2 ) ≥
σr2 (M∗N2 ,· ) ≥ ψr2 & (np)r2η2 and kΘ∗N2 k2→∞ . (r/n)1/2 ψ1 . r1/2+η1 p1/2 . Thus, under the same
asymptotic regime, conditions of Lemma 35 hold. Therefore, with probability converging to 1,
σr (I2,j (Θ̃N2 )) ≥ 2−2 δ2∗ πmin ψr2 & δ2∗ πmin (np)r2η2 . (150)
Note that
min{2−2 (γ2,j (Θ̃N2 ))−1 (κ∗3 )−1 σr2 (Θ̃N2 )}

j
&(C13 nπmax )−1 (κ∗3 )−1 (δ2∗ πmin ψr2 )2

(151)
&((p1/2 r1/2+η1 )3 nπmax )−1 (κ∗3 )−1 (δ2∗ )2 πmin
2
(np)2 r4η2
=(κ∗3 )−1 (δ2∗ )2 (πmin
2
/πmax )p1/2 nr−3/2−3η1 +4η2 .
Under the asymptotic regime nπmin (κ∗2 )2 (κ∗3 )2 (δ2∗ )−4 (πmax /πmin )3 r5+8η1 −8η2 (log(np))2 ,
1/2
we have κ∗2 r1+η1 log(np)p1/2 n1/2 πmax (κ∗3 )−1 (δ2∗ )2 (πmin
2 /π
max )p
1/2 nr −3/2−3η1 +4η2 . Under
the asymptotic regime eΘ,2→∞ (κ∗2 )−1 (κ∗3 )−1 (δ2∗ )2 (πmin /πmax )2 p1/2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) ·
(log(np))−1 , we have κ∗2 r(1+η1 )∨0 log(np) · nπmax eΘ,2→∞ (κ∗3 )−1 (δ2∗ )2 (πmin 2 /π
max )p
1/2 n ·
1/2
r−3/2−3η1 +4η2 . Combining the analysis, we have κ∗2 r1+η1 p1/2 n1/2 πmax log1/2 (np) + κ∗2 r(1+η1 )∨0 ·
∗ −1 ∗ 2 2
log(np)nπmax eΘ,2→∞ (κ3 ) (δ2 ) (πmin /πmax )p nr 1/2 −3/2−3η 1 +4η2 . This further implies
kZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k+kB2,j (Θ̃N2 )k+β2,j (Θ̃N2 )κ∗3 2−2 (γ2,j (Θ̃N2 ))−1 (κ∗3 )−1 σr2 (I2,j (Θ̃N2 ))
(152)
for all j. According to (150), σr (I2,j (Θ̃N2 ))C2 &
∗ 2η
δ2 πmin (np)r (r/p)
2 1/2 & ∗ 1/2
δ2 πmin np r 1/2+2η2 . According to (142),
nπmin (κ∗2 )2 (δ2∗ )−2 (πmax /πmin )r1+2η1 −4η2 log2 (np), which implies
40
1/2
κ∗2 r1+η1 log(np)p1/2 n1/2 πmax δ2∗ πmin np1/2 r1/2+2η2 . According to (144),
∗ −1 ∗
eΘ,2→∞ (κ2 ) δ2 (πmin /πmax )r (−1/2−η 1 +2η 2 )∧(1/2+2η2 ) −1 1/2
(log(np)) p , which implies
κ∗2 r(1+η1 )∨0 log(np) · nπmax eΘ,2→∞ δ2∗ πmin np1/2 r1/2+2η2 . Combine the analysis, we obtain
kZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k+kB2,j (Θ̃N2 )k+β2,j (Θ̃N2 )κ∗3 σr (I2,j (Θ̃N2 ))C2 (153)
for all j.
The inequalities (152) and (153) verify conditions of Lemma 17 (with C1 replaced by 2C1 ).
According to Lemma 17 and combining (149) and (150), with probability converging to 1,
kÃ − A∗ k2→∞
≤ max σr−1 (I2,j (Θ̃N2 )){kZTN2 ,j diag(ΩN2 ,j )Θ̃N2 k+kB2,j (Θ̃N2 )k+β2,j (Θ̃N2 )κ∗3 }
j∈[p]
n o
−1
.κ∗2 (δ2∗ )−1 πmin (np)−1 r−2η2 r1+η1 p1/2 n1/2 πmax
1/2
log(np) + r(1+η1 )∨0 log(np)nπmax eΘ,2→∞
n o
∗ ∗ −1 −2η2 −1/2 1+η1 −1/2 (1+η1 )∨0 −1/2
.κ2 (δ2 ) (πmax /πmin )r log(np)p r (nπmax ) +r p eΘ,2→∞ .
(154)
According to (153), kÃ − A ∗

P k2→∞ ≤ C2 . TIn addition,
T
ãj is the unique solution to to the
optimization problem maxaj ∈Rr i∈N2 ωij {yij θi âj −b(θi âj )} for all j because this optimization
is strictly convex by (150).
Lemma 40 (Asymptotic analysis for M̃N2 · = Θ̃N2 ÃT ). Assume that limn,p→∞ P(kM̂N1 · −
M∗N1 · kF ≤ eM,F ) = 1, and the following asymptotic regime holds:
1. φ . 1;
3. (np)1/2 rη2 . σr (M∗ ) ≤ σ1 (M∗ ) . (np)1/2 rη1 for some constants η1 and η2 ;
4.
pπmin
(κ∗2 )4 (δ2∗ )−6 (log(np))3
(155)
h
· max (πmax /πmin )3 r(1+2η1 )∨(3+2η1 −4η2 )∨(1−4η2 ) ,
i
(κ∗3 )2 (πmax /πmin )5 r(3+2η1 )∨(3+4η1 )∨{7+8(η1 −η2 )}∨(5+6η1 −8η2 ) ;
5.
nπmin
(κ∗2 )2 (δ2∗ )−4 (log(np))2 max {(πmax /πmin )r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (156)
(κ∗3 )2 (πmax /πmin )3 r5+8η1 −8η2 };
41
C HEN AND L I
6.
(np)−1/2 eM,F
(κ∗2 )−2 (δ2∗ )3 (log(np))−1 (πmin /πmax )3 min [r(−η1 +η2 )∧(−1−2η1 +3η2 )∧(−η1 +3η2 ) , (157)
(κ∗3 )−1 r(−2−η1 )∨{−3−5(η1 −η2 )}∧(−2−4η1 +5η2 ) ].
Then, with probability converging to 1,
kM̃N2 · − M∗N2 · kmax

h
.(δ2∗ )−2 (κ∗2 )2 (πmax /πmin )2 log3/2 (np) r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(p ∧ n)πmax }−1/2 (158)
i
+ r(2+3η1 −3η2 )∨(1+2η1 −3η2 ) (np)−1/2 eM,F .
Proof [Proof of Lemma 40] First, we analyze the asymptotic regime assumption. The 4-th condition
of the asymptotic regime, i.e.,
pπmin
(κ∗2 )4 (δ2∗ )−6 (log(np))3
(159)
h
i
(κ∗3 )2 (πmax /πmin )5 r(3+2η1 )∨(3+4η1 )∨{7+8(η1 −η2 )}∨(5+6η1 −8η2 )
implies
pπmin

∗ −4 ∗ 2 2 1∨(1+2η1 )∨(1−2η2 ) (π ∗ 2 3 5∨(3+2η1 )∨(3+4η1 ) },
(δ2 ) (κ2 ) (log(n)) max {r
 max /πmin ), (κ3 ) (πmax /πmin ) r
(κ∗2 )4 (δ2∗ )−6 (πmax /πmin )3 (log(np))3 r(3+2η1 −4η2 )∨(1−4η2 ) ,

 ∗ 2 ∗ 4 ∗ −6
(κ3 ) (κ2 ) (δ2 ) (πmax /πmin )5 r{7+8(η1 −η2 )}∨(5+6η1 −8η2 ) (log(np))3 ,
(160)
where we used the fact 1 ≤ (1 + 2η1 ) ∨ (1 − 2η2 ), 3 + 2η1 − 4η2 > 2 − 2η2 , and 7 + 8(η1 − η2 ) > 5.
The 6-th condition of the asymptotic regime, i.e.,
(np)−1/2 eM,F
(κ∗2 )−2 (δ2∗ )3 (log(np))−1 (πmin /πmax )3 min [r(−η1 +η2 )∧(−1−2η1 +3η2 )∧(−η1 +3η2 ) , (161)
(κ∗3 )−1 r(−2−η1 )∨{−3−5(η1 −η2 )}∧(−2−4η1 +5η2 ) ]
implies

η2
r ,


rη2 (κ∗ )−1 (δ ∗ )2 min {r−(η1 −η2 ) (π /π ), (κ∗ )−1 r−2−η1 (π /π )2 },

min max min max
(np)−1/2 eM,F ∗
2
−2 ∗ 3
2
2 −1 (−1−2η
3
+3η )∧(−η +3η )


(κ2 ) (δ2 ) (πmin /πmax ) (log(np)) r 1 2 1 2 ,
(κ∗ )−2 (δ ∗ )3 (π /π )3 (log(np))−1 (κ∗ )−1 r{−3−5(η1 −η2 )}∧(−2−4η1 +5η2 ) ,

2 2 min max 3
(162)
42
where we used the fact that η2 ≥ −1 − 2η1 + 3η2 and η2 − (η1 − η2 ) ≥ −1 − 2η1 + 3η2 .
According to (162), eM,F (np)1/2 rη2 . ψr , which implies that the conditions
for Lemma 14 holds. Thus, with probability converging to 1, kÂ − A∗ kF ≤ eA,F ,
where eA,F = 8ψr eM,F . Note that eA,F . r−η2 (np)−1/2 eM,F . According to (162),
−1
eM,F (np)1/2 rη2 (κ∗2 )−1 (δ2∗ )2 min {r−(η1 −η2 ) (πmin /πmax ), (κ∗3 )−1 r−2−η1 (πmin /πmax )2 },
which implies eA,F (κ∗2 )−1 (δ2∗ )2 min {r−(η1 −η2 ) (πmin /πmax ), (κ∗3 )−1 r−2−η1 (πmin /πmax )2 }.
According to (160),
(δ2∗ )−4 (κ∗2 )2 (log(n))2 max {r1∨(1+2η1 )∨(1−2η2 ) (πmax /πmin ), (κ∗3 )2 (πmax /πmin )3 r5∨(3+2η1 )∨(3+4η1 ) }
pπmin . Thus, the asymptotic regime of Lemma 38 is satisfied.
According to Lemma 38, kΘ̂N2 − Θ∗N2 k2→∞ ≤ eΘN2 ,2→∞ , with probability converging to 1,
for eΘN2 ,2→∞ satisfying
eΘN2 ,2→∞
∼κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 + r1/2+η1 eA,F } (163)
.κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 +r 1/2+η1
·r −η2 −1/2
(np) eM,F }.
Next, we verify that the asymptotic regime of Lemma 39 is satisfied. We first verify conditions
about eΘ,2→∞ . According to (160),
pπmin (κ∗2 )4 (δ2∗ )−6 (πmax /πmin )3 (log(np))3 r(3+2η1 −4η2 )∨(1−4η2 ) ,
which implies
κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 · r(log(n))1/2 (pπmax )−1/2

(164)
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 (πmin /πmax )r(−1/2−η1 +2η2 )∧(1/2+2η2 ) .
According to (160), pπmin (κ∗3 )2 (κ∗2 )4 (δ2∗ )−6 (πmax /πmin )5 r{7+8(η1 −η2 )}∨(5+6η1 −8η2 ) (log(np))3 ,
which implies
κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 r(log(n))1/2 (pπmax )−1/2

(165)
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 (κ∗3 )−1 (πmin /πmax )2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) .
According to (162), (np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (πmin /πmax )2 (log(np))−1 r(−1−2η1 +3η2 )∧(−η1 +3η2 ) ,
which implies
κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 · r1/2+η1 · r−η2 (np)−1/2 eM,F
(166)
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 (πmin /πmax )r(−1/2−η1 +2η2 )∧(1/2+2η2 ) .
According to (162),
(np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (πmin /πmax )3 (log(np))−1 (κ∗3 )−1 r{−3−5(η1 −η2 )}∧(−2−4η1 +5η2 ) ,
which implies
κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 · r1/2+η1 · r−η2 (np)−1/2 eM,F
(167)
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 (κ∗3 )−1 (πmin /πmax )2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) .
43
C HEN AND L I
Combining the equations (164)–(167), we have
κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 + r1/2+η1 −η2 (np)−1/2 eMN1 ,· ,F }
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1
· min{(πmin /πmax )r(−1/2−η1 +2η2 )∧(1/2+2η2 ) , (κ∗3 )−1 (πmin /πmax )2 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) }
(168)
which implies eΘN2 ,2→∞ satisfies the 5-th condition of the asymptotic regime of Lemma 39.
On the other hand, according to the lemma’s assumption,
nπmin
(κ∗2 )2 (δ2∗ )−4 (log(np))2 max {(πmax /πmin )r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (κ∗3 )2 (πmax /πmin )3 r5+8η1 −8η2 }.
(169)
Thus, the other requirements for the asymptotic regime in Lemma 39 are also satisfied.
According to Lemma 39, we have kÃ − A∗ k2→∞ ≤ eA,2→∞ with probability converging to 1,
where
n o
eA,2→∞ ∼ κ∗2 (δ2∗ )−1 (πmax /πmin )r−2η2 log(np) r1+η1 p−1/2 (nπmax )−1/2 + r(1+η1 )∨0 p−1/2 eΘ,2→∞ .
(170)
Combining the above display with (163), we further have
eA,2→∞
h
.κ∗2 (δ2∗ )−1 (πmax /πmin )r−2η2 log(np)p−1/2 r1+η1 (nπmax )−1/2
+ r(1+η1 )∨0 p−1/2
i
· κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 + r1/2+η1 · r−η2 (np)−1/2 eM,F }
h
.(δ2∗ )−2 (κ∗2 )2 (log(np))3/2 (πmax /πmin )2 p−1/2 r(2+η1 −2η2 )∨(1−2η2 ) {(p ∧ n)πmax }−1/2
i
+ r(3/2+2η1 −3η2 )∨(1/2+η1 −3η2 ) (np)−1/2 eM,F .
(171)
Now, we combine the above analysis to find an upper bound for kM̃N2 · −M∗N2 kmax . Recall that
M̃N2 · = Θ̃N2 ÃT . Thus, for P̂ ∈ Or×r defined in (11), and Θ∗N2 = (U∗r )N2 · D∗r P̂, A∗ = Vr∗ P̂,
we have
M̃N2 · − M∗N2 ·
=Θ̃N2 ÃT − (U∗r )N2 · D∗r (Vr∗ )T
=Θ̃N2 ÃT − (U∗r )N2 · D∗r P̂(Vr∗ P̂)T (172)
=Θ̃N2 ÃT − Θ∗N2 (A∗ )T
=(Θ̃N2 − Θ∗N2 )(A∗ )T + Θ̃N2 (Ã − A∗ )T .
44
Therefore, according to Lemma 38, with probability converging to 1,

≤kΘ̃N2 − Θ∗N2 k2→∞ kA∗ k2→∞ +kÃ − A∗ k2→∞ kΘ̃N2 k2→∞ (173)
1/2 1/2 1/2+η1
.(r/p) eΘ,2→∞ + p r eA,2→∞ .
Combine the above inequality with (163) and (171), we obtain
.(r/p)1/2 · κ∗2 (δ2∗ )−1 (πmax /πmin )p1/2 {r(log(n))1/2 (pπmax )−1/2 + r1/2+η1 −η2 (np)−1/2 eM,F }
+ p1/2 r1/2+η1 (δ2∗ )−2 (κ∗2 )2 (log(np))3/2 (πmax /πmin )2 p−1/2
h i
· r(2+η1 −2η2 )∨(1−2η2 ) {(p ∧ n)πmax }−1/2 + r(3/2+2η1 −3η2 )∨(1/2+η1 −3η2 ) (np)−1/2 eM,F
h
.(δ2∗ )−2 (κ∗2 )2 (πmax /πmin )2 log3/2 (np) r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(p ∧ n)πmax }−1/2
i
+ r(2+3η1 −3η2 )∨(1+2η1 −3η2 ) (np)−1/2 eM,F .
(174)
A.5 Additional theoretical results for Algorithm 2 with data splitting

We provide the following theoretical result for M̃ obtained from Algorithm 2 that extends Theo-
rem 10 to allow σr (M∗ ) and σ1 (M∗ ) growing at different asymptotic orders and πmin and πmax
decaying at different orders.
Lemma 41 (Asymptotic analysis for M̃ with data splitting). Assume that limn,p→∞ P(kM̂Nk · −
M∗Nk · kF ≤ eM,F ) = 1 (k = 1, 2), and the following asymptotic regime holds:
1. φ . 1;
4.
pπmin
(κ∗2 )4 (δ2∗ )−6 (log(np))3
(175)
h
i
(κ∗3 )2 (πmax /πmin )5 r(3+2η1 )∨(3+4η1 )}∨{7+8(η1 −η2 )∨(5+6η1 −8η2 ) ;
5.
nπmin
(κ∗2 )2 (δ2∗ )−4 (log(np))2 max {(πmax /πmin )r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (κ∗3 )2 (πmax /πmin )3 r5+8η1 −8η2 };
(176)
45
C HEN AND L I
6.
(np)−1/2 eM,F
(κ∗2 )−2 (δ2∗ )3 (log(np))−1 (πmin /πmax )3 min [r(−η1 +η2 )∧(−1−2η1 +3η2 )∧(−η1 +3η2 ) ,
(κ∗3 )−1 r(−2−η1 )∨{−3−5(η1 −η2 )}∧(−2−4η1 +5η2 ) ].
(177)
unique solution and
kM̃ − M∗ kmax
h
.(δ2∗ )−2 (κ∗2 )2 (πmax /πmin )2 log3/2 (np) r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(p ∧ n)πmax }−1/2
i
+ r(2+3η1 −3η2 )∨(1+2η1 −3η2 ) (np)−1/2 eM,F .
(178)
Proof [Proof of Lemma 41] Recall that M̃ = (m̃ij )i∈[n],j∈[p] , where (m̃ij )i∈N1 ,j∈[p] =
(2) (1)
Θ̃N1 (Ã(2) )T and (m̃ij )i∈N2 ,j∈[p] = Θ̃N2 (Ã(1) )T . The error rate for (m̃ij )i∈N2 ,j∈[p] =
(1) (1) T
Θ̃N2 (Ã ) is obtained by Lemma 40, and the error rate of (m̃ij )i∈N1 ,j∈[p] is obtained by swapping
(1) (2)
(Â(1) , Θ̃N2 , Ã(1) , N1 ) with (Â(2) , Θ̃N1 , Ã(2) , N2 ) in the proof of Lemma 40.
The uniqueness of the solution to estimating equations in steps 3 and 4 of Algorithm 2 is proved
by the uniqueness property in Lemma 38 and 39.
A.6 Proof of Theorem 10

Proof [Proof of Theorem 10] Note that when πmin ∼ πmax ∼ π and η1 = η2 = η, the 4-th
asymptotic requirement in Lemma 41 becomes
h i
pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 · max r(1+2η)∨(3−2η)∨(1−4η) , (κ∗3 )2 r(3+2η)∨(3+4η)∨{7∨(5−2η)} .
(179)
When η ≥ −1, the above requirement is implied by

h i
pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 · max r(1+2η)∨5 , (κ∗3 )2 r(3+4η)∨7 , (180)
which is the asymptotic requirement R5.

Similarly, the 5-th asymptotic requirement in Lemma 41 becomes
nπ (κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r1∨(1−2η) , (κ∗3 )2 r5 },
which is implied by the asymptotic requirement R6: nπ

(κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r3 , (κ∗3 )2 r5 }.
46
The 6-th asymptotic requirement becomes
(np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (log(np))−1 min [r0∧(−1+η)∧(2η) , (κ∗3 )−1 r(−2−η)∧(−3)∧(−2+η) ],
(181)
and is implied by (np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (log(np))−1 min [r−2 , (κ∗3 )−1 r−3 ], and further
implied by the asymptotic requirement R7’.
Thus, under R1-R6 and R7’, the conditions of Lemma 41 is satisfied and with probability con-
verging to 1,
kM̃ − M∗ kmax
h
.(δ2∗ )−2 (κ∗2 )2 log3/2 (np) r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(p ∧ n)π}−1/2
i
(2+3η1 −3η2 )∨(1+2η1 −3η2 ) −1/2
+r (np) eM,F
h i (182)
.(δ2∗ )−2 (κ∗2 )2 log3/2 (np) r5/2∨(3/2−η) {(p ∧ n)π}−1/2 + r2∨(1−η) (np)−1/2 eM,F
h i
.(δ2∗ )−2 (κ∗2 )2 log3/2 (np) r5/2 {(p ∧ n)π}−1/2 + r2 (np)−1/2 eM,F
h i
.(δ2∗ )−2 (κ∗2 )2 log2 (np)r5/2 {(p ∧ n)π}−1/2 + (np)−1/2 eM,F .
The above analysis gives the error bound of M̃.

To proceed to prove the ‘in particular’ part of the theorem. We note that if r . 1, then σ1 (M∗ ) ∼
σr (M∗ ) ∼ (np)1/2 and C1 ∼ n−1/2 σ1 (M∗ ) . p1/2 and C2 ∼ p−1/2 . As a result, kM∗ kmax ≤
C1 C2 . 1 and thus 2ρ + 1 . 1. This implies that δ2∗ & 1, κ∗2 , κ∗3 . 1. The proof is completed by
combining the above analysis with (182).
Appendix B. Proof of Theorem 5 and Additional Theoretical Results for Algorithm 1

without Data Splitting
In this section, we provide analysis for Θ̃, Ã, and M̃ obtained from Algorithm 1 without data
splitting. Let
P̂ = arg min kV̂r − Vr∗ PkF (183)
P∈Or
and A∗ = Vr∗ P̂ and Θ∗ = U∗r D∗r P̂. With similar derivations as those for Lemma 14, we have the
following lemma.
Lemma 42. If limn,p→∞ P(kM̂ − M∗ kF ≥ eM,F ) = 0, eM,F is a non-random number (depending

on n and p), kVr∗ k2→∞ ≤ C2 , eM,F ≤ 2−1 σr (M∗ ), and P̂ is defined in (183) then
lim P(kÂ − Vr∗ P̂kF ≥ eA,F ) = 0, (184)

n,p→∞
where eA,F = 8σr−1 (M∗ )eM,F .
47
C HEN AND L I
The rest of the section is organized as follows. In Section B.1, we obtain non-asymptotic prob-
abilistic bounds for terms involved in the estimating equations in Step 3 and 4 of Algorithm 1. In
Section B.2, we obtain asymptotic error bounds for kΘ̃ − Θ∗ k2→∞ (Lemma 48). In Section B.3,
we provide error bound kM̃ − M∗ kmax (Lemma 49) under a general setting. Finally, the proof of
Theorem 5 is given in Section B.4.
B.1 Non-asymptotic Analysis

We first analyze each term in Lemma 16 with A = Â obtained from Algorithm 1 without data
splitting.
Lemma 43 (Upper bound for kZi· diag(Ωi· )Âk without data splitting). Assume n ≥ 2. kM∗ kmax ≤
ρ. Assume that kA∗ k2→∞ ≤ C2 and Â may be dependent with Ω, n ≥ r. Then, with probability at
least 1 − 2(nr)−1 ,
maxkZi· diag(Ωi· )Âk
i∈[n]
≤8{φ1/2 (κ2 (2ρ + 1))1/2 C2 log1/2 (nr)r1/2 pmax

1/2
∨ r1/2 φC2 /(ρ + 1) log(nr)} (185)
+ 8 log(np){(φκ∗2 )1/2 ∨ 1} · pmax

1/2
kÂ − A∗ kF .
Proof [Proof of Lemma 43] Note that
kZi· diag(Ωi· )Âk≤ kZi· diag(Ωi· )A∗ k+kZi· diag(Ωi· )(Â − A∗ )k (186)
and
kZi· diag(Ωi· )(Â − A∗ )k
Xp
=k zij ωij (âj − a∗j )k
j=1
p (187)
X
≤kZkmax ωij kâj − a∗j k
j=1
1/2
≤kZkmax pmax kÂ − A∗ kF .
Combining the above two inequalities and taking maximum over i ∈ [n], we have
maxkZi· diag(Ωi· )Âk≤ max{kZi· diag(Ωi· )A∗ k} + kZkmax p1/2 ∗
max kÂ − A kF . (188)
i∈[n] i∈[n]
For the first term on the right-hand side of the above inequality, we follow a similar proof as that in
the proof of Lemma 18 (with Â replaced by A∗ ) and obtain that with probability at least 1 − (nr)−1
max{kZi· diag(Ωi· )A∗ k} ≤ 8{φ1/2 (κ2 (2ρ+1))1/2 C2 log1/2 (nr)r1/2 p1/2
max ∨r
1/2
φC2 /(ρ+1) log(nr)}.
i∈[n]
(189)
For the second term on the right-hand side of equation (188), we apply Lemma 30 and obtain that
with probability at least 1 − (np)−1 ,
∗ ∗ 1/2 ∗
kZkmax p1/2
max kÂ − A kF ≤ 8 log(np){(φκ2 ) ∨ 1} · p1/2
max kÂ − A kF . (190)
The proof is completed by combining the above two inequalities.
48
Lemma 44 (Upper bound for kB1,i (Â)k without data splitting). Let A∗ = Vr∗ P̂ and Θ∗ =
U∗r D∗r P̂. Assume kÂk2→∞ , kVr∗ k2→∞ ≤ C2 and kU∗r D∗r k2→∞ ≤ C1 , and Â may be dependent
with Ωi· Then,
kB1,i (Â)k≤ C1 C2 κ∗2 pmax
1/2
kÂ − A∗ kF . (191)
Proof [Proof of Lemma 44] First, by the assumptions and P̂ is orthogonal, kΘ∗ k2→∞ =
kU∗r D∗r k2→∞ ≤ C1 and kA∗ k2→∞ = kVr∗ k2→∞ ≤ C2 . Recall that
p
X
kB1,i (Â)k=k ωij b00 (m∗ij )âj (âj − a∗j )T θi∗ k
j=1
p
X
≤C1 C2 ωij b00 (m∗ij )kâj − a∗j k (192)
j=1
p
X
≤C1 C2 κ∗2 ωij kâj − a∗j k.
j=1
Applying Cauchy-Schwarz inequality, we further obtain
kB1,i (Â)k≤ C1 C2 κ∗2 pmax

1/2
kÂ − A∗ kF . (193)
The proof is completed by taking maximum for i ∈ [n].
Lemma 45 (Bound for β1,i (Â), without data splitting). If kU∗r D∗r k2→∞ ≤ C1 ,
kÂk2→∞ , kVr∗ k2→∞ ≤ C2 , then,
max β1,i (Â) ≤ C12 C2 kÂ − A∗ k2F . (194)

i∈[n]
Proof [Proof of Lemma 45] Recall

X X
β1,i (Â) = sup ωij ((âj − a∗j )T θi∗ )2 |âTj u|≤ C12 C2 ωij kâj − a∗j k2 ≤ C12 C2 kÂ − A∗ k2F .
kuk=1 j j∈[p]
(195)
Lemma 46 (Bound for γ1,i (Â), without data splitting). If kÂk2→∞ ≤ C2 , then with probability at
least 1 − 1/n,
max γ1,i (Â) ≤ 2pπmax C23 . (196)
i∈[n]
Proof [Proof of Lemma 46] The proof of this Lemma is the same as that of Lemma 24 which does
not require the independence between Â and Ωi· .
Lemma 47.
maxkdiag(Ωi· )(Â − A∗ )k22 ≤ kÂ − A∗ k2F . (197)
i∈[n]
49
C HEN AND L I
maxkdiag(Ωi· )(Â − A∗ )k22 ≤ maxkdiag(Ωi· )k22 kÂ − A∗ k2F = kÂ − A∗ k2F . (198)
i∈[n] i∈[n]
B.2 Asymptotic Analysis for Algorithm 1 without Data Splitting

Lemma 48 (Asymptotic analysis of Ã without data splitting). Let A∗ = Vr∗ P̂, Θ∗ = U∗r D∗r P̂,
and P̂ is defined in (183). Assume that limn,p→∞ P(kÂ − A∗ kF ≥ eA,F ) = 1.
Assume the following asymptotic regime holds:
1. φ ∼ 1, πmin ∼ πmax ∼ π;
3. (np)1/2 rη2 . σr (M∗ ) ≤ σ1 (M∗ ) . (np)1/2 rη1 , and η1 and η2 are constants;
4. pπ (δ2∗ )−4 (κ∗2 )2 log2 (n) max {r1∨(1−2η2 ) , (κ∗3 )2 r5 };
5. eA,F (κ∗2 )−1 (δ2∗ )2 (log(np))−1 min{r0∧(−1/2−η1 +η2 )∧(1/2+η2 ) , (κ∗3 )−1 r(−5/2−η1 )∧(−3/2) }π 1/2 .
Then, with probability converging to 1, there is Θ̃ = (θ̃iT )i∈[n] such that S1,i (θ̃i , Â) = 0, for all
i ∈ [n],kΘ̃ − Θ∗ k2→∞ ≤ C1 , and
kΘ̃ − Θ∗ k2→∞ . κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 )∨0 p1/2 eA,F }. (199)
T
P
Moreover, θ̃i is the unique solution to the optimization problem maxθi ∈Rr j∈[p] ωij {yij θi âj −
b(θiT âj )} for all i ∈ [n].
Proof [Proof of Lemma 48] First, we provide analysis on the asymptotic regime. Note that κ∗2 ≥
κ2 (0) & 1 and δ2∗ ≤ δ2 (0) . 1. Then, the 4-th condition on the asymptotic regime, i.e.,
pπ (δ2∗ )−4 (κ∗2 )2 log2 (n) max {r1∨(1−2η2 ) , (κ∗3 )2 r5 } (200)
implies the following asymptotic regimes,


2
r(log n) ,

pπ (κ∗2 )2 (κ∗3 )2 (δ2∗ )−4 r5 log(n), (201)

 ∗ 2 ∗ −2
(κ2 ) (δ2 ) log(n)r1−2η2 .
Similarly, the 5-th condition on the asymptotic regime, i.e.,
eA,F (κ∗2 )−1 (δ2∗ )2 (log(np))−1 min{r0∧(−1/2−η1 +η2 )∧(1/2+η2 ) , (κ∗3 )−1 r(−5/2−η1 )∧(−3/2) }π 1/2
(202)
50
implies 


(κ∗3 )−1 r−1/2−η1 π 1/2 ,
π 1/2 ,

eA,F (203)


(κ∗2 )−1 (κ∗3 )−1 (δ2∗ )2 (log(np))−1 r(−5/2−η1 )∧(−3/2) π 1/2 ,
δ ∗ (κ∗ )−1 (log(np))−1 r(−1/2−η1 +η2 )∧(1/2+η2 ) π 1/2 ,

2 2
where we used the fact that −1/2 − η1 > −5/2 − η1 .

Throughout the proof, we restrict the analysis on the event {kÂ − A∗ kF ≤ eA,F } ∩ {pmax ≤
2pπmax }, which has probability converging to 1 by the lemma’s assumption, and Lemma 23. On
this event, we have that with probability at least 1 − 1/n,
maxkZi· diag(Ωi· )Âk

i∈[n]
≤16{φ1/2 (κ2 (2ρ + 1))1/2 C2 log1/2 (nr)r1/2 (pπmax )1/2 ∨ r1/2 φC2 /(ρ + 1) log(nr)} (204)
+ 8{(φκ∗2 )1/2 ∨ 1} log(np) · (pπmax )1/2 eA,F .
according to Lemma 43. Under the asymptotic regime that φ . 1, πmin ∼ πmax ∼ π, C2 .
(r/p)1/2 , the above inequality implies
maxkZi· diag(Ωi· )Âk. (κ∗2 )1/2 r log1/2 (n)π 1/2 + rp−1/2 log(n) + (κ∗2 )1/2 log(np)p1/2 π 1/2 eA,F .
i∈N2
(205)
According to (201), pπ r(log n)2 , which implies rp−1/2 log(n) (κ∗2 )1/2 r log1/2 (n)π 1/2 .
Thus, the above display implies
maxkZi· diag(Ωi· )Âk. (κ∗2 )1/2 r log1/2 (n)π 1/2 + (κ∗2 )1/2 log(np)p1/2 π 1/2 eA,F (206)
i∈N2
with probability converging to 1.

Next, according to Lemma 44,
maxkB1,i (Â)k≤ C1 C2 κ∗2 pmax

1/2
kÂ − A∗ kF . (207)
i∈[n]
Note that C1 C2 . r1+η1 . Thus, the above display implies that with probability converging to one,
maxkB1,i (Â)k. κ∗2 r1+η1 p1/2 π 1/2 eA,F . (208)

i∈[n]
Combining equations (206) and (208), we obtain
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k} . κ∗2 {r log1/2 (n)π 1/2 + log(np)r(1+η1 )∨0 p1/2 π 1/2 eA,F }.
i∈[n]
(209)
Next, we consider maxi∈[n] {β1,i (Â)}κ∗3 . According to Lemma 45, we have
max β1,i (Â) ≤ C12 C2 kÂ − A∗ k2F . (210)

i∈[n]
51
C HEN AND L I
Note that C12 C2 . r3/2+2η1 p1/2 . Thus, the above display implies
max{β1,i (Â)}κ∗3 . κ∗3 r3/2+2η1 p1/2 e2A,F . (211)

i∈[n]
According to (203), eA,F (κ∗3 )−1 r−1/2−η1 π 1/2 . This implies
κ∗3 r3/2+2η1 p1/2 e2A,F . κ∗2 log(np)r(1+η1 )∨0 p1/2 π 1/2 eA,F .
Thus, combining (209) and (211), we obtain
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 }

i∈[n]
(212)
.κ∗2 {log1/2 (n)rπ 1/2 + log(np)r (1+η1 )∨0 1/2 1/2
p π eA,F }.
Next, we find a lower bound for σr (I1,i (Â)). With similar derivations as those for (129), we have
min σr2 (diag(Ωi· )A∗ ) ≥ 2−1 π (213)

i∈[n]
with probability converging to 1 under the asymptotic regime pπ r(log(n))2 . According to

Lemma 47,
maxkdiag(Ωi· )(Â − A∗ )k22 ≤ kÂ − A∗ k2F ≤ e2A,F . (214)
i∈[n]
According to (203), eA,F π 1/2 . Thus, the above two inequalities and Lemma 25 together imply
that with probability converging to 1,
min σr (I1,i (Â)) ≥ 2−3 δ2∗ π. (215)

i∈[n]
Next, we verify conditions of Lemma 16. According to Lemma 46, on the event pmax ≤ 2pπmax ,
maxi∈[n] γ1,i (Â) . (pπ(r/p)3/2 ). Following similar arguments as those for (133), we have with
probability tending to 1,
min{(γ1,i (Â))−1 (κ3 (3C1 C2 ))−1 σr2 (I1,i (Â))}

i∈[n]
&(pπ)−1 (r/p)−3/2 (κ∗3 )−1 π 2 (δ2∗ )2 (216)
=(κ∗3 )−1 (δ2∗ )2 p1/2 r−3/2 π.
Under the asymptotic regime pπ (κ∗2 )2 (κ∗3 )2 (δ2∗ )−4 r5 log(n), we have
∗ 1/2
κ2 π r(log(n)) 1/2 ∗ −1 ∗
(κ3 ) (δ2 ) p r 2 1/2 −3/2 π. Under the asymptotic
regime eA,F (κ∗2 )−1 (κ∗3 )−1 (δ2∗ )2 (log(np))−1 r(−5/2−η1 )∧(−3/2) π 1/2 , we have
∗
κ2 log(np)r (1+η 1 )∨0 p π eA,F (κ∗3 )−1 (δ2∗ )2 p1/2 r−3/2 π. Combining the analysis, we
1/2 1/2
have κ∗2 {log1/2 (n)rπ 1/2 + log(np)r(1+η1 )∨0 p1/2 π 1/2 eA,F } (κ∗3 )−1 (δ2∗ )2 p1/2 r−3/2 π. This,
together with (216) implies with probability tending to 1,
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 } min{(γ1,i (Â))−1 (κ3 (3C1 C2 ))−1 σr2 (I1,i (Â))}.
i∈[n] i∈[n]
(217)
52
According to (215) and note that C1 & r1/2+η2 p1/2 , we have

σr (I1,i (Â))C2 & δ2∗ πr1/2+η2 p1/2 . (218)
According to (201), pπ (κ∗2 )2 (δ2∗ )−2 log(n)r1−2η2 , which implies κ∗2 log1/2 (n)rπ 1/2
δ2∗ πr1/2+η2 p1/2 . According to (203), eA,F δ2∗ (κ∗2 )−1 (log(np))−1 r(−1/2−η1 +η2 )∧(1/2+η2 ) π 1/2 ,
which implies κ∗2 log(np)r(1+η1 )∨0 p1/2 π 1/2 eA,F δ2∗ πr1/2+η2 p1/2 . Combining the analysis with
(201) and (212), we obtain with probability tending to 1,
max{kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 } min σr (I1,i (Â))C2 . (219)
i∈[n] i∈[n]
Thus, conditions of Lemma 16 are satisfied. According to Lemma 16 with A replaced by Â and
according to (212) and (215), we have kΘ̃ − Θ∗ k2→∞ ≤ C1 and
kΘ̃ − Θ∗ k2→∞
h i
≤ max (σr (I1,i (Â)))−1 {kZi· diag(Ωi· )Âk+kB1,i (Â)k+β1,i (Â)κ∗3 }
i∈[n] (220)
.(δ2∗ π)−1 κ∗2 {r log1/2 (n)π 1/2 + log(np)r(1+η1 )∨0 p1/2 π 1/2 eA,F }
=κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 )∨0 p1/2 eA,F }
with probability converging to 1. Moreover, from (215) the optimization problem
T T
P
maxθi ∈R r
j∈[p] ωij {yij θi âj − b(θi âj )} is strictly convex. Thus, θ̃i is the unique solution to
this optimization problem.
B.3 Additional Theoretical Results for Algorithm 1 without Data Splitting

Lemma 49. Let M̃ be obtained by Algorithm 1. Assume that limn,p→∞ P(kM̂ − M∗ kF ≤ eM,F ) =
1, and the following asymptotic regime holds:
1. φ ∼ 1, πmin ∼ πmax ∼ π;
h i
4. pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 ·max r(1+2η1 )∨(3+2η1 −4η2 )∨(1−4η2 ) , (κ∗3 )2 r{7+8(η1 −η2 )}∨(5+6η1 −8η2 ) ;
5. nπ (κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (κ∗3 )2 r5+8η1 −8η2 };
6.
(np)−1/2 eM,F
(κ∗2 )−2 (δ2∗ )3 (log(np))−2 π 1/2 ·
min [r(1/2+2η2 )∧(−3/2−2η1 +3η2 )∧(−1/2−η1 +3η2 )∧(1/2+3η2 ) ,
(κ∗3 )−1 r(−7/2−5η1 +5η2 )∧(−5/2−4η1 +5η2 )∧(−3/2−3η1 +5η2 ) ].
(221)
53
C HEN AND L I
unique solution and
kM̃ − M∗ kmax
h
.(δ2∗ )−2 (κ∗2 )2 (log(np))2 r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(n ∧ p)π}−1/2 (222)
i
+ r(5/2+3η1 −3η2 )∨(3/2+2η1 −3η2 )∨(1/2+η1 −3η2 ) (npπ)−1/2 eM,F .
Proof First, we analyze the asymptotic regime assumption. The 4-th condition of the asymptotic
regime, i.e.,
h i
pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 · max r(1+2η1 )∨(3+2η1 −4η2 )∨(1−4η2 ) , (κ∗3 )2 r{7+8(η1 −η2 )}∨(5+6η1 −8η2 )
(223)
implies

∗ −4 ∗ 2 2 1∨(1−2η2 ) , (κ∗ )2 r 5 },
(δ2 ) (κ2 ) log (n) max {r
 3
pπ (δ2∗ )−6 (κ∗2 )4 (log(np))3 r(3+2η1 −4η2 )∨(1−4η2 ) , (224)

 ∗ −6 ∗ 4 ∗ 2
(δ2 ) (κ2 ) (κ3 ) (log(np))3 r{7+8(η1 −η2 )}∨(5+6η1 −8η2 ) ,
where we used the fact that 7 + 8(η1 − η2 ) ≥ 7 > 5, (1 + 2η1 ) ∨ (2 − 2η2 ) ≥ 1, and 2 − 2η2 <
3 + 2η1 − 4η2 .
The 6-th condition of the asymptotic regime, i.e.,
(np)−1/2 eM,F
(κ∗2 )−2 (δ2∗ )3 (log(np))−2 π 1/2 ·
(225)
min [r(1/2+2η2 )∧(−3/2−2η1 +3η2 )∧(−1/2−η1 +3η2 )∧(1/2+3η2 ) ,
(κ∗3 )−1 r(−7/2−5η1 +5η2 )∧(−5/2−4η1 +5η2 )∧(−3/2−3η1 +5η2 ) ]
implies



rη2 (κ∗2 )−1 (δ2∗ )2 (log(np))−1 r0∧(−1/2−η1 +η2 )∧(1/2+η2 ) π 1/2 ,
rη2 (κ∗ )−1 (δ ∗ )2 (log(np))−1 (κ∗ )−1 r(−5/2−η1 )∧(−3/2) π 1/2 ,

(np)−1/2 eM,F 2 2 3
∗ )3 (κ∗ )−2 (log(np))−2 r (−3/2−2η1 +3η2 )∧(−1/2−η1 +3η2 )∧(1/2+3η2 ) π 1/2 ,


(δ 2 2
(δ ∗ )3 (κ∗ )−2 (κ∗ )−1 (log(np))−2 r(−7/2−5η1 +5η2 )∧(−5/2−4η1 +5η2 )∧(−3/2−3η1 +5η2 ) π 1/2 ,

2 2 3
(226)
where we used the fact that η2 ≥ −1/2 − η1 + 2η2 , −1/2 − η1 + 2η2 > −3/2 − 2η1 + 3η2 ,
−5/2 − η1 + η2 > −7/2 − 5η1 + 5η2 , and −3/2 + η2 > −5/2 − 4η1 + 5η2 .
According to (226),
eM,F
(np)1/2 rη2 (κ∗2 )−1 (δ2∗ )2 (log(np))−1 min{r0∧(−1/2−η1 +η2 )∧(1/2+η2 ) , (κ∗3 )−1 r(−5/2−η1 )∧(−3/2) }π 1/2 ,
(227)
54
which, together with Lemma 42, implies
eA,F
(κ∗2 )−1 (δ2∗ )2 (log(np))−1 min{r0∧(−1/2−η1 +η2 )∧(1/2+η2 ) , (κ∗3 )−1 r(−5/2−η1 )∧(−3/2) }π 1/2 .
(228)
Also, according to the lemma’s assumption, pπ (δ2∗ )−4 (κ∗2 )2 log2 (n) max {r1∨(1−2η2 ) , (κ∗3 )2 r5 }.
Thus, the conditions of Lemma 48 are satisfied. According to Lemma 48, kΘ̂ − Θ∗ k2→∞ ≤
eΘ,2→∞ , with probability converging to 1, for eΘ,2→∞ satisfying
eΘ,2→∞
∼κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 )∨0 p1/2 eA,F }
(229)
.κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 )∨0 p1/2 · r−η2 (np)−1/2 eM,F }
∼κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 −η2 )∨(−η2 ) n−1/2 eM,F }.
Note that the proof of Lemma 39 does not require the independence between Θ̃N2 and the missing
pattern Ω. Thus, following similar arguments, Lemma 39 still applies with Θ̃N2 replaced with Θ̃
and N2 replaced with [n]. Next, we verify that the asymptotic regime of Lemma 39 is satisfied.
According to (224), pπ (δ2∗ )−6 (κ∗2 )4 (log(np))3 r(3+2η1 −4η2 )∨(1−4η2 ) , which implies
κ∗2 (δ2∗ )−1 π −1/2 r(log(n))1/2 (δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 r(−1/2−η1 +2η2 )∧(1/2+2η2 ) . (230)
According to (224), pπ (δ2∗ )−6 (κ∗2 )4 (κ∗3 )2 (log(np))3 r{7+8(η1 −η2 )}∨(5+6η1 −8η2 ) , which implies
κ∗2 (δ2∗ )−1 π −1/2 r(log(n))1/2 (δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 (κ∗3 )−1 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) .
(231)
According to (226),
(np)−1/2 eM,F (δ2∗ )3 (κ∗2 )−2 (log(np))−2 r(−3/2−2η1 +3η2 )∧(−1/2−η1 +3η2 )∧(1/2+3η2 ) π 1/2 , which
implies
κ∗2 (δ2∗ )−1 π −1/2 log(np)r(1+η1 −η2 )∨(−η2 ) n−1/2 eM,F

(232)
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 r(−1/2−η1 +2η2 )∧(1/2+2η2 ) .
According to (226),
(np)−1/2 eM,F (δ2∗ )3 (κ∗2 )−2 (κ∗3 )−1 (log(np))−2 r(−7/2−5η1 +5η2 )∧(−5/2−4η1 +5η2 )∧(−3/2−3η1 +5η2 ) π 1/2 ,
which implies
κ∗2 (δ2∗ )−1 π −1/2 log(np)r(1+η1 −η2 )∨(−η2 ) n−1/2 eM,F

(233)
(δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 (κ∗3 )−1 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) .
Combining equations (229), (230), (231), (232) and (233), we have
eΘ,2→∞
κ∗2 (δ2∗ )−1 π −1/2 (δ2∗ )2 (κ∗2 )−1 p1/2 (log(np))−1 · min{r(−1/2−η1 +2η2 )∧(1/2+2η2 ) , (234)
(κ∗3 )−1 r(−5/2−4η1 +4η2 )∧(−3/2−3η1 +4η2 ) },
55
C HEN AND L I
which implies eΘ,2→∞ satisfies the 5-th condition of the asymptotic regime of Lemma 39.
On the other hand, according to the lemma’s assumption,
nπmin
(κ∗2 )2 (δ2∗ )−4 (log(np))2 max {(πmax /πmin )r(1+2η1 −2η2 )∨(1+2η1 −4η2 ) , (κ∗3 )2 (πmax /πmin )3 r5+8η1 −8η2 }.
(235)
Thus, the other requirements for the asymptotic regime in Lemma 39 are also satisfied.
According to Lemma 39, we have kÃ − A∗ k2→∞ ≤ eA,2→∞ with probability converging to 1,
where
n o
eA,2→∞ ∼ κ∗2 (δ2∗ )−1 r−2η2 log(np)p−1/2 r1+η1 (nπ)−1/2 + r(1+η1 )∨0 p−1/2 eΘ,2→∞ . (236)
Combining the above display with (229), we further have
eA,2→∞
h
.κ∗2 (δ2∗ )−1 r−2η2 log(np)p−1/2 r1+η1 (nπ)−1/2
i
+ r(1+η1 )∨0 p−1/2 · κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 −η2 )∨(−η2 ) n−1/2 eM,F }
h
.(δ2∗ )−2 (κ∗2 )2 (log(np))2 p−1/2 r(2+η1 −2η2 )∨(1−2η2 ) {(n ∧ p)π}−1/2
i
+ r(2+2η1 −3η2 )∨(1+η1 −3η2 )∨(−3η2 ) (npπ)−1/2 eM,F .
(237)
Next, we derive an asymptotic upper bound for kM̃ − M∗ kmax . Recall that M̃ = Θ̃ÃT .
Thus, for P̂ ∈ Or×r defined in (183) and Θ∗ = (U∗r )D∗r P̂, A∗ = Vr∗ P̂, we have M̃ − M∗ =
Θ̃ÃT − Θ∗ (A∗ )T = (Θ̃ − Θ∗ )(A∗ )T + Θ̃(Ã − A∗ )T . Thus,
kM̃ − M∗ kmax ≤ kΘ̃ − Θ∗ k2→∞ kA∗ k2→∞ +kÃ − A∗ k2→∞ kΘ̃k2→∞ . (238)
According to Lemma 48 and the assumption kA∗ k2→∞ ≤ C2 . (r/p)1/2 , with probability con-
verging to 1, the above display is further bounded by
kM̃ − M∗ kmax . eΘ,2→∞ r1/2 p−1/2 + eA,2→∞ r1/2+η1 p1/2 . (239)
56
Combining the above inequality with (229) and (237), we obtain with probability tending to 1
kM̃ − M∗ kmax
.r1/2 p−1/2 · κ∗2 (δ2∗ )−1 π −1/2 {r(log(n))1/2 + log(np)r(1+η1 −η2 )∨(−η2 ) n−1/2 eM,F }
h
+ r1/2+η1 p1/2 · (δ2∗ )−2 (κ∗2 )2 (log(np))2 p−1/2 r(2+η1 −2η2 )∨(1−2η2 ) {(n ∧ p)π}−1/2
i
+ r(2+2η1 −3η2 )∨(1+η1 −3η2 )∨(−3η2 ) (npπ)−1/2 eM,F
h
.(δ2∗ )−2 (κ∗2 )2 (log(np))2 r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(n ∧ p)π}−1/2
i
+ r(3/2+η1 −η2 )∨(1/2−η2 )∨(5/2+3η1 −3η2 )∨(3/2+2η1 −3η2 )∨(1/2+η1 −3η2 ) (npπ)−1/2 eM,F
h
.(δ2∗ )−2 (κ∗2 )2 (log(np))2 r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(n ∧ p)π}−1/2
i
+ r(5/2+3η1 −3η2 )∨(3/2+2η1 −3η2 )∨(1/2+η1 −3η2 ) (npπ)−1/2 eM,F ,
(240)
where we used the fact that 3/2 + η1 − η2 < 5/2 + 3η1 − 3η2 and 1/2 − η2 < 3/2 + 2η1 − 3η2 in
the last inequality. This completes the proof.
B.4 Proof of Theorem 5

Proof [Proof of Theorem 5]
Note that when πmin ∼ πmax ∼ π and η1 = η2 = η, the 4-th asymptotic requirement in
Lemma 49 becomes
h i
pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 · max r(1+2η)∨(3−2η)∨(1−4η) , (κ∗3 )2 r7∨(5−2η) . (241)
When η ≥ −1, the above requirement is implied by

h i
pπ (κ∗2 )4 (δ2∗ )−6 (log(np))3 · max r (1+2η)∨5
, (κ∗3 )2 r7 , (242)
which is implied by the asymptotic requirement R5.

Similarly, the 5-th asymptotic requirement in Lemma 49 becomes
nπ (κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r1∨(1−2η) , (κ∗3 )2 r5 },
which is implied by the asymptotic requirement R6: nπ

∗ 2 ∗ −4 2 3 ∗ 2 5
(κ2 ) (δ2 ) (log(np)) max {r , (κ3 ) r }.
The 6-th asymptotic requirement in Lemma 49 becomes
(np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (log(np))−2 π 1/2 min [r(1/2+2η)∧(−3/2+η)∧(−1/2+2η)∧(1/2+3η) ,

(κ∗3 )−1 r(−7/2)∧(−5/2+η)∧(−3/2+2η) ]
(243)
and is implied by R7: (np)−1/2 eM,F (κ∗2 )−2 (δ2∗ )3 (log(np))−2 π 1/2 min [r−5/2 , (κ∗3 )−1 r−7/2 ]
for η ≥ −1.
57
C HEN AND L I
Thus, under R1-R7, the conditions of Lemma 49 are satisfied, and thus with probability con-
verging to 1,
kM̃ − M∗ kmax
.(δ2∗ )−2 (κ∗2 )2 (log(np))2
h
· r(5/2+2η1 −2η2 )∨(3/2+η1 −2η2 ) {(n ∧ p)π}−1/2
i
+ r(5/2+3η1 −3η2 )∨(3/2+2η1 −3η2 )∨(1/2+η1 −3η2 ) (npπ)−1/2 eM,F
h i
.(δ2∗ )−2 (κ∗2 )2 (log(np))2 r5/2∨(3/2−η) {(n ∧ p)π}−1/2 + r(5/2)∨(3/2−η)∨(1/2−2η2 ) (npπ)−1/2 eM,F
h i
.(δ2∗ )−2 (κ∗2 )2 (log(np))2 r5/2 {(n ∧ p)π}−1/2 + r5/2 (npπ)−1/2 eM,F .
(244)
The above analysis gives the error bound of M̃. The proof for the ‘in particular’ part of the
theorem is similar to that of the proof of Theorem 10, and we skip the repetitive details.
Appendix C. Proof of Corollaries
Proof [Proof of Corollary 8] For binomial model b00 (x) = kex (1 + ex )−2 and b(3) (x) = kex (1 +
ex )−2 {1−2(1+e−x )−1 }. Thus, κ2 (α) ≤ k, κ3 (α) ≤ k, and δ2 (α) ≥ keα (1+eα )−2 & ke−α . This
implies that κ∗2 , κ∗3 . 1 under the asymptotic regime that k ∼ 1 (R8B). Also, δ2∗ & ke−2(ρ+1) &
1−
e−2ρ & ke−2 log(n∧p) 0 (n ∧ p)−1 for any constant 1 > 0, where the third inequality is due to
R9B. Combining the analysis above, we have (κ∗2 )4 (δ2∗ )−6 log(np)3 (n ∨ p)61 log(np)3 (n ∨
p)71 . Similarly, (κ∗2 )2 (δ2∗ )−4 (log(np))2 (n∨p)51 , and (κ∗2 )−2 (δ2∗ )3 (log(np))−2 (n∧p)−41 .
Combine the above analysis with R5B – R7B, and note that (1 + 2η) ∨ 5 ≤ (3 + 4η) ∨ 7 for η ≥ −1,
we verify that R5 – R7 hold with 71 < 0 .
For normal model, b00 (x) = 1 and b(3) (x) = 0 for all x. Thus, κ∗2 = δ2∗ = 1 and κ∗3 = 0. Part 2
of Corollary 8 then follows by simplifying Theorem 5.
In the rest of the analysis, we focus on the Poisson model. Note that kM∗ k≤ C1 C2 so we
could choose ρ ≤ C1 C2 . r1+η . Under R10P, r1+η . (log(n ∧ p))1−0 , so max(ρ, C1 C2 ) .
(log(n ∧ p))1−0 .
For Poisson model, b(x) = ex so b00 (x) = b(3) (x) = ex . Thus, κ2 (α), κ3 (α) ≤ eα and
1−
δ2 (α) ≥ e−α . This implies κ∗2 ≤ e2ρ+1 . e2ρ . e2(log(n∧p)) 0 . (n ∧ p)1 for any constant
1 > 0. Similarly, δ2∗ & e−2ρ & (n ∧ p)−1 and κ∗3 . e6C1 C2 . (n ∨ p)1 for any constant 1 > 0.
The proof then follows similarly as that for the normal model.
Proof [Proof of Corollary 12] The proof of Corollary 12 is similar to that of Corollary 8, except
that R7B is replaced by R7’B to ensure R7’ holds. We omit the repetitive details.
58
Setting n p r π Variable Types Setting n p r π Variable Type

1 400 200 3 0.6 O 13 400 200 5 0.6 O
2 800 400 3 0.6 O 14 800 400 5 0.6 O
3 1600 800 3 0.6 O 15 1600 800 5 0.6 O
4 400 200 3 0.2 O 16 400 200 5 0.2 O
5 800 400 3 0.2 O 17 800 400 5 0.2 O
6 1600 800 3 0.2 O 18 1600 800 5 0.2 O
7 400 200 3 0.6 O+C 19 400 200 5 0.6 O+C
8 800 400 3 0.6 O+C 20 800 400 5 0.6 O+C
9 1600 800 3 0.6 O+C 21 1600 800 5 0.6 O+C
10 400 200 3 0.2 O+C 22 400 200 5 0.2 O+C
11 800 400 3 0.2 O+C 23 800 400 5 0.2 O+C
12 1600 800 3 0.2 O+C 24 1600 800 5 0.2 O+C
Table 5: Simulation settings. ‘Variable type = O’ indicates all the variables are ordinal (with kj = 5),
and ‘Variable type = O + C’ indicates half of the variables are ordinal (with kj = 5) and half are
continuous. For continuous and ordinal variables, we assume the Normal and Binomial models,
respectively.
Appendix D. Simulation Settings and Additional Results

D.1 Simulation Setting Details
A full list of our simulation settings is given in Table 5 below. For each setting, data are generated
as follows. For each replication, we first generate Θ∗ = (θik ∗ ) ∗ ∗
n×r and A = (aij )p×r , where
θik s and aij s are independently from a uniform distribution over the interval [−0.9, 0.9]. Then M∗
∗ ∗
is given by M∗ = Θ∗ (A∗ )T . The missing indicators ωij s are generated independently from a
Bernoulli distribution with parameter π, where π = 0.6 and 0.2 are considered in the simulation
settings. When ωij = 1 and for an ordinal variable j, Yij is generated from a Binomial distribution
with kj = 5 trials and success probability exp(m∗ij )/(1 + exp(m∗ij )). When ωij = 1 and for an
continuous variable j, Yij is generated from a normal distribution N (m∗ij , 1). In the implementation,
√
we set C2 = 2 r/p in Algorithms 1, 2, and 3. We set ρ0 = r in the NBE and C = r in the
p
CJMLE.
D.2 Additional Simulation Results

In Figures 3 though 8 below, we give the results under Settings 7 through 24. The patterns are
similar to those in Figures 1 and 2, except for few cases when n and p are relatively small.
59
C HEN AND L I
0.5
0.5
0.5

0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
2.0
Max norm
Max norm
Max norm
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1.

0.7
0.7
0.7
0.6
0.6
0.6

0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

5
5
4
4
Max norm
Max norm
Max norm
3
3
2
2
1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1.
60
0.8
0.8
0.8

0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

5
5
4
4
Max norm
Max norm
Max norm
3
3
2
2
1
1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1.

1.0
1.0
1.0
0.8
0.8
0.8

0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

8
8
6
6
Max norm
Max norm
Max norm
4
4
2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1.
61
C HEN AND L I
0.8
0.8
0.8

0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

5
5
4
4
Max norm
Max norm
Max norm
3
3
2
2
1
1
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1.

1.0
1.0
1.0
0.8
0.8
0.8

0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

8
8
6
6
Max norm
Max norm
Max norm
4
4
2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Figure 1.
62
References
Emmanuel Abbe, Jianqing Fan, Kaizheng Wang, and Yiqiao Zhong. Entrywise eigenvector analysis
of random matrices with low expected rank. Annals of statistics, 48(3):1452, 2020.
Mokhtar Z Alaya and Olga Klopp. Collective matrix completion. Journal of Machine Learning
Research, 20:1–43, 2019.
David J Bartholomew, Fiona Steele, Irini Moustaki, and Jane I Galbraith. Analysis of multivariate
social science data. CRC Press, Boca Raton, FL, 2008.
Yoav Bergner, Peter Halpin, and Jill-Jênn Vie. Multidimensional item response theory in the style
of collaborative filtering. Psychometrika, 87(1):266–288, 2022.
Sonia A Bhaskar. Probabilistic low-rank matrix completion from quantized measurements. The
Journal of Machine Learning Research, 17(1):2131–2164, 2016.
Pratik Biswas, Tzu-Chen. Lian, Ta-Chung. Wang, and Yinyu Ye. Semidefinite programming based
algorithms for sensor network localization. ACM Transactions on Sensor Networks (TOSN), 2
(2):188–220, 2006.
Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymp-
totic theory of independence. Oxford University Press, Oxford, England, 2013.
Tony Cai and Wen-Xin Zhou. A max-norm constrained minimization approach to 1-bit matrix
completion. Journal of Machine Learning Research, 14(1):3619–3647, 2013.
Tony Cai and Wen-Xin Zhou. Matrix completion via max-norm constrained optimization. Elec-
tronic Journal of Statistics, 10(1):1493–1525, 2016.
Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foun-
dations of Computational Mathematics, 9(6):717–772, 2009.
Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix com-
pletion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.
Yang Cao and Yao Xie. Poisson matrix recovery and completion. IEEE Transactions on Signal
Processing, 64(6):1609–1620, 2015.
Joshua Cape, Minh Tang, and Carey E Priebe. The two-to-infinity norm and singular subspace
geometry with applications to high-dimensional statistics. Annals of Statistics, 47(5):2405–2439,
2019.
Sourav Chatterjee. Matrix estimation by universal singular value thresholding. The Annals of
Statistics, 43(1):177–214, 2015.
Yunxiao Chen and Xiaoou Li. Determining the number of factors in high-dimensional generalized
latent factor models. Biometrika, 109(3):769–782, 2022.
Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Joint maximum likelihood estimation for high-
dimensional exploratory item factor analysis. Psychometrika, 84(1):124–146, 2019a.
63
C HEN AND L I
Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Structured latent factor analysis for large-scale data:
Identifiability, estimability, and their implications. Journal of the American Statistical Associa-
tion, 115(532):1756–1770, 2020a.
Yunxiao Chen, Xiaoou Li, and Siliang Zhang. Structured latent factor analysis for large-scale data:
Identifiability, estimability, and their implications. Journal of the American Statistical Associa-
tion, 115:1756–1770, 2020b.
Yunxiao Chen, Chengcheng Li, Jing Ouyang, and Gongjun Xu. Statistical inference for noisy
incomplete binary matrix. Journal of Machine Learning Research, 24(95):1–66, 2023.
Yuxin Chen, Jianqing Fan, Cong Ma, and Yuling Yan. Inference and uncertainty quantification
for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931–
22937, 2019b.
Yuxin Chen, Yuejie Chi, Jianqing Fan, Cong Ma, and Yuling Yan. Noisy matrix completion: Under-
standing statistical guarantees for convex relaxation via nonconvex optimization. SIAM Journal
on Optimization, 30(4):3098–3121, 2020c.
Victor Chernozhukov, Christian Hansen, Yuan Liao, and Yinchu Zhu. Inference for low-rank mod-
els. The Annals of statistics, 51(3):1309–1330, 2023.
Mark A Davenport, Yaniv Plan, Ewout Van Den Berg, and Mary Wootters. 1-bit matrix completion.
Information and Inference: A Journal of the IMA, 3:189–223, 2014.
Andrey Feuerverger, Yu He, and Shashi Khatri. Statistical significance of the Netflix challenge.
Statistical Science, 27:202–231, 2012.
David Goldberg, David Nichols, Brian M Oki, and Douglas Terry. Using collaborative filtering to
weave an information tapestry. Communications of the ACM, 35(12):61–70, 1992.
Shelby J Haberman. When can subscores have value? Journal of Educational and Behavioral
Statistics, 33(2):204–229, 2008.
Ruijian Han, Rougang Ye, Chunxi Tan, and Kani Chen. Asymptotic theory of sparse Bradley–Terry
model. Annals of Applied Probability, 30:2491–2515, 2020.
Ruijian Han, Yiming Xu, and Kani Chen. A general pairwise comparison model for extremely
sparse networks. Journal of the American Statistical Association, 118(544):2422–2432, 2023.
F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM
Transactions on Interactive Intelligent Systems (TIIS), 5(4):1–19, 2015.
Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alter-
nating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of
computing, pages 665–674, 2013.
Anura P Jayasumana, Randy Paffenroth, Gunjan Mahindre, Sridhar Ramasamy, and Kelum Ga-
jamannage. Network topology mapping from partial virtual coordinates and graph geodesics.
IEEE/ACM Transactions on Networking, 27(6):2405–2417, 2019.
64
Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy
entries. Journal of Machine Learning Research, 11:2057–2078, 2010.
Olga Klopp. Noisy low-rank matrix completion with general sampling distribution. Bernoulli, 20
(1):282–303, 2014.
Olga Klopp, Jean Lafond, Eric Moulines, and Joseph Salmon. Adaptive multinomial matrix com-
pletion. Electronic Journal of Statistics, 9(2):2950–2975, 2015.
Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-norm penalization and
optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329,
2011.
Geofferey N Masters and Benjamin D Wright. The essential process in a family of measurement
models. Psychometrika, 49(4):529–544, 1984.
Andrew D McRae and Mark A Davenport. Low-rank matrix completion and denoising under Pois-
son noise. Information and Inference: A Journal of the IMA, 10(2):697–720, 2021.
Sahand Negahban and Martin J Wainwright. Restricted strong convexity and weighted matrix com-
pletion: Optimal bounds with noise. Journal of Machine Learning Research, 13(1):1665–1697,
2012.
OECD. PISA 2018 assessment and analytical framework. OECD Publishing, Paris, France, 2019a.
OECD. PISA 2018 techinical report. OECD Publishing, Paris, France, 2019b.
James M Ortega and Werner C Rheinboldt. Iterative solution of nonlinear equations in several
variables. SIAM, Philadelphia, PA, 2000.
Mark Reckase. Multidimensional item response theory. Springer, New York, NY, 2009.
Geneviève Robin, Julie Josse, Éric Moulines, and Sylvain Sardy. Low-rank model with covariates
for count data with missing values. Journal of Multivariate Analysis, 173:416–434, 2019.
Geneviève Robin, Olga Klopp, Julie Josse, Éric Moulines, and Robert Tibshirani. Main effects and
interactions in mixed and incomplete data frames. Journal of the American Statistical Associa-
tion, 115(531):1292–1303, 2020.
Anders Skrondal and Sophia Rabe-Hesketh. Generalized latent variable modeling: Multilevel,
longitudinal, and structural equation models. CRC Press, Boca Raton, FL, 2004.
Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational
mathematics, 12(4):389–434, 2012.
Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends R in

Machine Learning, 8(1-2):1–230, 2015.
Michel Wedel, Ulf Böckenholt, and Wagner A Kamakura. Factor models for multivariate count
data. Journal of Multivariate Analysis, 87(2):356–369, 2003.
65
C HEN AND L I
Per-Åke Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numer-
ical Mathematics, 12:99–111, 1972.
Dong Xia and Ming Yuan. Statistical inferences of linear forms for noisy matrix completion. Journal
of the Royal Statistical Society: Series B (Statistical Methodology), 83(1):58–77, 2021.
Haoran Zhang, Yunxiao Chen, and Xiaoou Li. A note on exploratory item factor analysis by singular
value decomposition. Psychometrika, 85(2):358–372, 2020.
66

A Note On Entrywise Consistency For Mixed-Data Matrix Completion

Uploaded by

Copyright:

Available Formats

A Note On Entrywise Consistency For Mixed-Data Matrix Completion

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Note On Entrywise Consistency For Mixed-Data Matrix Completion

Uploaded by

Copyright:

Available Formats

Journal of Machine Learning Research 25 (2024) 1-66 Submitted 6/23; Revised 5/24; Published 8/24

A Note on Entrywise Consistency for Mixed-data Matrix Completion

Yunxiao Chen Y. CHEN 186@ LSE . AC . UK

Editor: Ali Shojaie

c 2024 Yunxiao Chen and Xiaoou Li.

2. Mixed-data Matrix Completion

2.2 Problem Setup

2.3 Generalized Latent Factor Model

3. Refined Estimation for Entrywise Consistency

3.1 Refinement without Data Splitting

Algorithm 1: Refinement Procedure without Data Splitting

kM̃ − M∗ kmax ≤ kΘ̃ − Θ∗ k2→∞ kA∗ k2→∞ +kÃ − A∗ k2→∞ kΘ̃k2→∞

3.2 Refinement with Data Splitting

Algorithm 2: Refinement Procedure with Data Splitting

Algorithm 3: Refinement Procedure with Multiple Data Splittings

3.3 F-consistent Estimators

Assumption 3. We choose C2 such that C2 ≥ kVr∗ k2→∞ .

Define the following quantities that depend on M∗ . Let ρ = maxi∈[n],j∈[p] |m∗ij |, C1 =

4.2 Error Analysis without Data Splitting

R2: πmin ∼ πmax ∼ π;

R3: kU∗r k2→∞ . (r/n)1/2 , kVr∗ k2→∞ . (r/p)1/2 , C2 ∼ (r/p)1/2 ;

R4: σr (M∗ ) ∼ σ1 (M∗ ) ∼ (np)1/2 rη for some constants η ≥ −1;

R6: nπ  (κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r3 , (κ∗3 )2 r5 };

In particular, if we further assume that r ∼ 1, then, the asymptotic regime requirements R5 – R7

kM̃ − M∗ kmax . (log(np))2 [{(n ∧ p)π}−1/2 + (npπ)−1/2 eM,F ].

4.3 Error Analysis with Data Splitting

Procedure Initial Refinement Procedure Initial Refinement

Table 1: Estimation procedures compared in a simulation study.

Setting 1 Setting 2 Setting 3

Scaled Frobenius norm

Scaled Frobenius norm

Procedure Procedure Procedure

Setting 1 Setting 2 Setting 3

Procedure Procedure Procedure

6. Real Data Examples

Setting 4 Setting 5 Setting 6

Scaled Frobenius norm

Scaled Frobenius norm

Procedure Procedure Procedure

Setting 4 Setting 5 Setting 6

Procedure Procedure Procedure

6.2 Large-scale Assessment in Education

This research is partially supported by National Science Foundation CAREER DMS-2143844.

Appendix A. Proof of Theorem 10 and Additional Theoretical Results for

A.1 Error Analysis for Â

Lemma 13. Let ψr = σr (M∗N1 · ) ∧ σr (M∗N2 · ) and ψ1 = σr (M∗N1 · ) ∨ σr (M∗N2 · ). If kM̂N1 · −

The next lemma is obtained by directly applying Lemma 13.

lim P(kÂ − Vr∗ P̂kF ≥ eA,F ) = 0, (13)

where P̂ is defined in (11) and eA,F = 8ψr−1 eM,F .

A.2 Non-probabilistic Bounds for Solutions to Estimating Equations

Multiplying (θ − θi∗ )T on both sides, we obtain

φ(θ − θi∗ )T S1,i (θ; A)

=κ3 ((C1 + ξ)C2 )(ξβ1,i + ξ 3 γ1,i ).

(θ − θi∗ )T φS1,i (θ; A) ≤ − σr (I1,i (A))ξ 2 + γ1,i κ3 ((C1 + ξ)C2 )ξ 3

kZi· diag(Ωi· )Ak+kB1,i (A)k+β1,i (A)κ3 (3C1 C2 )

then, there is θ̃i such that S1,i (θ̃; A) = 0, and

Moreover, the solution θ̃i also satisfies kθ̃i − θi∗ k≤ C1 .

Proof [Proof of Lemma 16]

Because the right-hand side of the above inequality equals ξ, it is simplified as

R6: nπ (κ∗2 )2 (δ2∗ )−4 (log(np))2 max {r3 , (κ∗3 )2 r5 };

for ∈ (0, r/10).

kB1,i (Â)k≤ κ∗2 πmax C1 kÂk2 kÂ − A∗ kF +{16 · log(r/)} · (πmax

[{8/3 · log(2r/)}1/2 ν 1/2 ] ∨ [{8/3 · log(2r/)}L]

6. and n r1+2(η1 −η2 ) log(r).