We propose an incomplete-data, quasi-likelihood framework for estimation and score tests that accommodates both dependent and partially observed data. The motivation comes from genetic association studies, where we address the problems of estimating haplotype frequencies and testing association between a disease and haplotypes of multiple, tightly linked genetic markers, using case-control samples containing related individuals.
We consider a more general setting in which the complete data are dependent with marginal distributions following a generalized linear model. We form a vector, Z, whose elements are conditional expectations of the elements of the complete-data vector, given selected functions of the incomplete data. Assuming that the covariance matrix of Z is available, we create an optimal linear estimating function based on Z, which we solve by an iterative method.
This approach addresses key difficulties in haplotype frequency estimation and testing problems in related individuals: (a) dependence that is known but can be complicated; (b) data that are incomplete for structural reasons, as well as possibly missing, with different amounts of information for different observations; (c) the need for computational speed to analyze large numbers of markers; and (d) a well-established null model but an alternative model that is unknown and is difficult to specify fully in related individuals.
For haplotype analysis, we give sufficient conditions for consistency and asymptotic normality of the estimator and asymptotic χ2 null distribution of the score test.
We apply the method to test for association of haplotypes with alcoholism in the GAW 14 COGA data set.
Request Reprint E-Mail: zwang@galton.uchicago.edu
_______________________________________________________