Integration of Nonprobability and Probability Samples via Survey Weights

Probability sample encounters the problems of increasing cost and nonresponse. The cost has rapidly been increasing in executing a large probability sample survey, and, for some surveys, response rate can be below the 10 percent level. Therefore, statisticians seek some alternative methods. One of them is to use a large nonprobability sample (S_1 ) supplemented by a small probability sample (S_2 ). Both samples are taken from the same population and they include common covariates, and a third sample (S_3 ) is created by combining these two samples; S_1  can be biased and S_2  may have large sample variance. These two problems are reduced by survey weights and combining the two samples. Although S_2  is a small sample, it provides good properties of unbiasedness in estimation and of survey weights. With these known weights, we obtain adjusted sample weights (ASW), and create a sample model from a finite population model. We fit the sample model to obtain its parameters and generate values from the population model. Similarly, we repeat these processes for other two samples, S_1  and S_3  and for different statistical methods. We show reduced biases of the finite population means and reduced variances.as the combined sample size becomes large. We analyze sample data to show the reduction of these two errors.


Introduction
Probability sampling has been the main tool for sample surveys since the 1900s. It provides unbiased and consistent estimates. Public and private survey organizations such as Gallop, US Census Bureau, National Center for Health Statistics (NCHS), National Institute of Health, and the Bureau of Labor Statistics have been using probability samples to produce official statistics of the US population such as public opinion, personal income, health status, economic indicators, unemployment rates, and other essential official statistics for the United States Government and other researchers.
However, more survey organizations are abandoning probability sampling because of high cost and increasing nonresponses (Beaumont, 2020). Nonprobability sampling is emerging as an attractive to probability sampling for its low cost and convenience. But we must pay for the low cost and convenience because there could be large bias.
A small probability sample is used to reduce such bias of nonprobability sample. Both samples, taken from the same population, have the common covariates. Bias is increased and variance is decreased in the combined sample over the probability sample, and the reverse (bias decreased and variance increased) is true with respect to the nonprobability sample. A small probability sample contributes valuable information of sample weights and unbiased mean. This mean is used to find the bias of nonprobability and combined samples. The known weights are also used to impute unknown weights of 1 .(Appendix A). Others often use these covariates in logistic function to obtain sample weights under certain assumptions (Chen et al. 2020). Instead, we use covariates matching that does not require such unreasonable assumptions. These known (or imputed) survey weights or original sample weights (OSW) are then used to obtain ASWs. We use these ASWs to reduce bias and variance in estimation by combining two samples.
For the National Health and Nutrition Examination Survey, NCHS collects a probability sample to investigate the health status of the US population. To present an example, we use a part of this NCHS sample to derive three samples: A rather large sample 1 by dropping the survey weights, a small sample 2 , and a third sample 3 is created combining We also use different methods for the estimation, non-Bayesian (least square, and bootstrap) and Bayesian methods to estimate the finite population mean. Within the Bayesian approach, we provide a simple model that has closed form answers if the non-sample covariates are known. Within the non-Bayesian approach, we use least square (LS), and bootstrap method, and bootstrap is used to calculate the variance of LS estimation mainly because the LS variance is too small when OSW is big. The Bayesian and LS methods are separately applied to the three samples, 1 , 2 and 3 . No matter which method we use, the result shows similar or same results of reducing bias by applying ASW and reducing variance by combining two samples. Meng (2018) defined the difference between sample mean ̅ = ∑ =1 , and unknown true population mean ̅ , ̅ = ∑ =1 as the error of the sample: i.e., This error estimate is the product of three terms. The first is data quality, i.e., correlation between response indicator =1, 0 and study variable y, data quality is increased by controlling at the level of , where the sample fraction = , N is the population size. When losing the control, the impact of N is no longer cancelled by , i.e., error estimation increases with √ relative to 1 √ . Bigger N makes the matter worse for nonprobability sample. He calls this large sample paradox. The bigness of big data for population inference should be measured by the relative size f = n/N, not the absolute value n. The third is problem difficulty, √ 2 , the standard deviation of study variable y's. When combining data for population inference, those relatively tiny but higher quality ones, i.e., probability samples, should be given far more weights than suggested by their sizes. Rao (2020) reviewed individual terms of the difference between the sample mean and the finite population mean.
Big sample data from the finite population provides small variance. For example, take independently distributed random variables, 1 ,….,~(0, 2 ). Then ̅~(0, 2 ) with unspecified distribution. For large n, 2 ≈ 0, this is a common large sample problem for statistical inference. Choi and Nandram (2021) showed how to use the random grouping method to get around this problem.
Any sample, probability or nonprobability, can be tested by ; Here, = 0 means the sample is biased. We investigate for 1 . Assume that a finite population of N individuals with finite population mean, ̅ of the response variables y=( 1 , . . . ). Let R=( 1 ,…, ) denote the sample indicators, and ̅ = Arrange it so that 1 ,…, are ones corresponding to the sample, 1 , . . . , for sampled individuals, and +1 ,…, are zeros corresponding to non-sampled individuals, +1 ,… , .
We can now use our probability sample 2 : 1 , . . . 2 , and its known OSWs 2 =:( 1 , . . . 2) ) to estimate unknown population mean ̅ , variance 2 , and population size N: where ̂= 2 .. Note defect size, ̂ partly depends on √̂2, i.e., the larger ̂ is in relation to 2 , the bigger the defect becomes. Meng (2018) called it a large data paradox, but it is not true in general. For example, common sense tells us that input of more information (bigger N) in developing self-driving cars, makes safer cars. Medical imagining, which is a pillar in diagnostic health, involves a high volume of data collection i.e. large N, and processing; these data are not biased but they can be unstructured. For our 1 data, ̂=0.006, which means that it has a large selection bias or defect (Meng, 2018) especially if the sample size is large; in our case, it is just about 1500. Survey organizations are trying to move away from probability sampling to reduce high cost (Sakshaug et al. 2019 andWisniowski et al. 2020). Instead, they use nonprobability sample (for example, web samples) which is less costly and easily available, but possibly brings in biases into the sample. To reduce this bias of nonprobability sample, they take a small probability sample from the same population. Then propose a Bayesian model to combine these samples in a way that exploits one sample's strengths to overcome the other sample's weakness. Chen et al. (2020) developed a model for nonprobability sample with a small probability sample under the assumption of ignorable response, i.e., selection probability = p( =1 | , ) = p( =1 | ). This assumption (Rubin 1976) is also used in nonresponse study (Nandram and Choi 2002a, 2002b, Nandram and Choi 2006, Nandram and Choi 2010. Potthoff et al. (1992) obtained adjusted survey weights (ASW) corresponding to the original weights. The original weights = 1/ ,i=1,…,n, where is the known selection probability of the ith unit from a population, and N =∑ =1 , is the finite population size. With these original weights they calculate the adjusted weights that are used to obtain more reasonable variance; as the original sample weights give too small variance. Nandram (2007) used surrogate sampling to sample the entire population after an adjusted sample model is fit to the data.
Our main contribution in this paper is to integrate a non-probability sample and a probability sample to make inference about a finite population mean using a very simple and easy to understand method. Specifically, we have made six important contributions.
(a) Obtain the survey weights for the non-probability sample using record linkage instead of the less robust logistic regression; see Chen et al. (2020). (c) Use bootstrap to obtain the distributions, including expectation and variance, of the least square and double robust estimators. We note that this step is not needed in the Bayesian method.
(d) For the simple situation we discussed, we showed that there is little difference among the different estimators.
However, the Bayesian method will become more useful in more complicated applications, and analytical approximations (e.g., Taylor's series expansion) can be avoided.
(e) We also estimate the correlation between the participation variable and the study variable for the non-probability sample. This is done by supplementing the non-probability sample with the probability sample. In Meng (2018), this correlation is non-identifiable and it cannot be estimated; this leads to a poor representation in Meng (2018) but he did not have a probability sample.
(f) Like Chen et al. (2020), we showed how to use inverse probability weighting when the population size and the nonsample covariates are unknown. This is done even for the Bayesian method, much beyond Chen et al. (2020). This paper is divided into five sections including the introduction. In Section 2, first we derive the ASWs and estimate the population means with 2 ,and separately done with 1 and 3 following the same process. In Section 3, we introduce two methods, non-Bayesian and Bayesian. In the non-Bayesian approach, we primarily use the LS method. To obtain the distribution of these estimators, we use the bootstrap method, which provides reasonable variance estimation.
Section 4 illustrates our methods by analyzing BMI data of NCHS. Finally, the Section 5 has a brief conclusion.
There may be common sample units belonging to both samples 1 and 2 as they are coming from the same population U. The size of nonprobability sample 1 is much bigger than that of probability sample 2 , 2 ≪ 1 . We assume that these two samples are mutually exclusive and independent.
The purpose of this paper is to estimate the finite population mean ̂= ∑ =1 , from these three samples as accurately as possible and reducing bias and variance if possible. In this paper, we created the two samples, 1 and 2 from the NCHS data.
Some researchers use logit function under ignorable assumption (Little and Rubin, 2002, Chen et al. 2020, Rubin,1976. Weights are used to reduce the bias introduced by sampling designs (Potthoff et al. 1992) and prediction is done using surrogate sampling (e.g., Nandram 2007) via inverse probability weighting (e.g., Chen et a. 2020). Assume that the original weights are known and fixed; we also assume that the imputed weights are fixed (not random) and known.

ASWs With Probability Sample
The study variable 2 is observed and the sample weights are known for sample S 2 . Note that, although both samples have the common covariates, age, race and sex, their contents are different. Hence, when they match, the matching should be done by content matching.
We drop the subscript 2 of 2 for general application. The model = + , ~(0, 2 ), for any probability sample of size n.
Each is adjusted by its corresponding known OSWs to obtain the sample mean.
Estimator of the finite population mean and effective sample size (ESS). 1≤̂≤ . Finally, we define the adjusted sample weights (ASWs) as

Note from here on upper case W for OSWs and lower-case w for ASWs
For the BMI data in Section 4, the effective sample size ̂ = 123 for the example of 2 , and the actual sample size 2 = 304, out of the population N = 2,360,624.
Theorem 1 states that there is actually an increase in variance over W, i.e., the variance with ASW is 2 , while the variance with OSW is 2 which is much smaller than the first. The variance is much too small over the OSW with large W while ASWs provide more reasonable variance with smaller w.
From the sample model, ~( ′ , 2 ), we can estimate the parameters and 2 by LS method (Appendix B) .
First, we can draw the , i=1, …, n, sample from this model, this is called surrogate sampling (Nandram 2007).
However, we do not know the non-sampled covariates, especially for big data, and this creates an important practical issue. Therefore, we obtain the estimate of population mean model, We can then generate samples from this distribution when , 2 become known.
Here average covariate ̅ is estimated by inverse probability weighting, and N=∑ =1 with OSWs and y is generated data from the model for .
Finally, we present an issue on unbiasedness. We have two samples, one is 1 and the other 2 from the finite population with true unknown mean ̅ . Note that the 1 is a convenient sample, whose weights are unknown. We have two sample based estimators of unknown population mean ̅ are: one is from 1 and the other from 2 : where 2 are the OSWs of the probability sample. Under probability sampling design D, ̅ 2 is unbiased because it is a Horvitz-Thompson estimator, but ̅ 1 is biased precisely because it is not probability sample (it does not have the sampling weights). The bias of But ( ̅ − ̅ 2 ) ≈ 0 because ( ̅ − ̅ 2 ) = 0, expectation is taken over the probability sampling design D. Then, the bias is ̅ − ̅ 2 ≈ ̅ 1 -̅ 2 ,the difference between the two sample estimators. The ESS and ASWs of 1 and 3 can be obtained in a similar manner as shown in (2.2) and (2.3), respectively.

ASW With Nonprobability Sample
For probability sample 2 , the selection probabilities (hence the weights) are known. When the contents of covariates (i.e., age, sex, race) of units in probability sample 2 match with those of sample 1 .The missing OSWs for 1 are filled with those of the known OSWs of 2 by covariate matching via Mahalanobis distances (Appendix A).
Let the response variable = ( 1 ), the indicator variable. The response variable = 1 if the unit i 1 and 0 otherwise. Assume ( = 1 | , ) = ( = 1 | ) of missing at random (Rubin 1976, Little andRubin 2002, Chen et al 2020) regardless the design D of the sampling of 1 . Under this assumption of missing at random, the propensity score can be obtained by the logistic function = ′ 1− ′ , i=1, …, 1 . Hence, the weights of nonprobability sample 1 is = 1 , i=1, …, 1 . However, this propensity score makes too much unreasonable assumptions on logistic function, and hardly reflects the true situation. Thus, in this paper, we do not use logistic function to obtain its unknown OSWs of 1 . The main reason that Mahalanobis distances does not require such unnecessary assumptions. Assume the imputed OSWs 1 are also fixed, not random, as done with the probability sample 2. . In actual sample in Section 4, size 1 = 1,563 for 1 , its ESS ̂1 = are imputed; ASWs 1 are used only in estimation, not in prediction.

Estimation of Finite population Mean Using Different Methods
We use two different approaches for the estimation of the finite population mean, Bayesian and non-Bayesian. The Bayesian approach is a completely closed form 95% highest posterior density interval (HPDI) for the finite population mean. We also present a sampling-based method, which is slightly more convenient. The non-Bayesian approach is basically LS. It also includes doubly robust method (DR), (e.g., Chen et al. 2020). When the variance from LS is too small for large N, we use the bootstrap to calculate reasonable variance and to obtain distributions. Each method uses data sets 1 , 2 , and 3 , separately for the estimation. DR method uses different estimation by combining 1 and 2 , not 3 .The Bayesian and non-Bayesian methods are presented in Section 3.1 and 3.2, respectively.

Bayesian Approach
To fit a general model, we drop the subscriptions for sample numbers, suppose a random sample 1,……, is taken from the population 1 , … , | , 2~f (y| , 2 , ), and p x 1 vector with the adjusted weights. We assume f(y| , 2 , ) is normal function, then and a priori, This is Jeffreys' improper prior, and it is well-known that the joint posterior density is in closed form and proper. The finite population mean is Let x is n x p matrix of covariates and = diag(w 1 ,…,w n ) is n x n diagonal matrix, and y is the n x 1 vector of response variables. Then, letting Two-sided 95% highest posterior density interval for ̅ is ̅ ′̂± ( ̅ | ) ( − ),0.025 , where ( − ),0.025 , is the 97.5 percentage point of t-distribution. This is nice, but the non-sampled covariates are all known, and it is convenient to use a sampling based method to make posterior inference about ̅ . We can use inverse probability weighting, where and N=∑ =1 Note that , ,and are the ASWs for the sample 1, 2 , and 3 , respectively. To fit the model for the individual data 1 , 2 , or 3 , we follow the above steps by adding sample subscript 1, 2 or 3. The model is the same for the individual data 1 , 2 , or 3 ; so we present only 2 .
For the simple model we consider here, there is virtually no practical difference between the non-Bayesian method (via the bootstrap) and the Bayesian method; the interpretations are different. However, for a more complicated model, a Bayesian method will be more attractive as it avoids analytical approximations.

Non-Bayesian Approach
Two methods will be introduced to estimate finite population mean: LS and DR estimator. When LS produces variance which is too small for inference, we use Bootstrap to get larger variance. Distributions are obtained by using the bootstrap. , The variance is too small when N is large, as an alternative, we use bootstrap method to calculate variance as shown before in the bootstrap method.

DR estimator of population mean (Chen et al 2020) is given by
are the two estimates of finite population totals, using two samples, 1 .and, 2 , , DR estimator of the finite population mean is given by where ̂ is the LS estimator obtained from the probability sample 2 . From the numerical example on BMI, the estimated finite population mean is ̅ 2 =27.075. Chen et al (2020) claim that the first equation ̅ 1 is doubly robust estimator; it is not clear why it is doubly robust. But since they use the logistic function for propensity scores and LS estimation for , they made several assumptions which may not be robust. Therefore, we used the matching method to estimate the propensity scores. Although we do not have a participation model, it is still useful to make comparisons with the DR estimators.
Because we have the y values in 2 data, we replace ′ with , and we have a third DR estimator, . Note that ̅ 3 does require LS estimation of from 1 , not 2 .
The inverse probability weighting estimators are sensitive to misspecified models for the propensity scores especially when they are very small; see Chen et al. (2020). If the participation model or the study variable model is correctly specified, this provides the double robustness property that is widely used in recent literature on missing data problems. We do not have a model for the participation variable because we use record linkage to fill in the propensity scores for the non-probability sample with the probability sample being the donor. So the double robust estimator is not really needed in our case, but we use it for comparison.

Bootstrap used to calculate variance for LS and DR estimates
We have seen that LS procedure gives very small, overly optimistic, estimates. This is mostly due to the large population size N. So, we have used the bootstrap method. We draw B=1,000 bootstrap samples from each of 1 , 2 , and 3 , get , , , , b = 1,…,B, k = 1, 2, 3.
Here we drop the subscript k of , and for sample indicator as they are applicable in general.
= ∑ =1 , b=1,…,B. Let ̂ denote the LS estimators. Then we compute the finite population mean as ̅ = ′ ̂, b = 1,…,B,. and bootstrap variance obtained. The bootstrap variances are shown in the tables for the LS and DR methods in Section 4. Note that the formula for LS estimation gives too small variance as shown Table 3 while bootstrap estimators in Table  4 give better representation for the variance of ̅ .
We bootstrap the data, both 1 and 2 , 1000 times in the conventional way. That is, we sample the 1 data and 2 data respectively with replacement to constitute a single sample. For each bootstrap sample, we compute the least square estimator and the double robust estimator. This gives us a sample of 1000 values of the estimators, and a kernel density estimator is obtained. The mean and variance of the estimator are obtained as summaries from these 1000 values and the 95\% confidence intervals are obtained by using the percentile method; further refinement can be done, of course.

Data Analysis
We now present detailed results from the BMI data using our estimators. We are assuming, when the weights of 2 are used, it provides unbiased estimate of the finite population mean. Therefore, we use the estimate from the 2 as a 'gold' standard for inference. However, because the 2 is relatively small, we expect it to provide an estimate with a relatively large standard error. There are two different methods, Bayesian and non-Bayesian. In the Bayesian approach, we provide posterior summaries and distributions (e.g., posterior mean, posterior standard deviation, posterior coefficient of variation, and 95% highest posterior density interval -HPDI). In the non-Bayesian approach, we use LS and DR estimates. The bootstrap is used to get distributions of the corresponding estimators and its variances. Analogous to the Bayesian approach, we have used LS estimate, its standard error, coefficient of variation and 95% confidence interval for the finite population mean.
The relative standard error (CV) of an estimate is obtained by dividing the standard error of an estimate by the estimate itself, well known as coefficient of variation, and it is usually expressed as a percentage, CV100%= ̅ ̅ x 100. (Choi, 1977, Appendix). NCHS put an asterisk on the number when CV is greater than 30% to let the data users know the number is not reliable. Note that CV is often not accurate and one needs to be careful to use it for statistical inference. is very small as n becomes large (Choi and Nandram, 2021). The large sample causes problems not only here but other statistical tests (e.g., normal test and t-test) for hypothesis. In general, sample variance is also a function of n. We use the body mass index (BMI) data from the National Health and Nutrition Examination Survey III (NHANES III), conducted from 1988 to 1994 in two phases by National Center for Health Statistics. There were 30,818 people examined from the US population of about 300 million and overall response rate was 78%. BMI is person's weight in kilograms divided by the square of height in meters. BMI is an inexpensive and easy screening method for health status. Table 2 below shows the weight status of a person. We only use a small portion of sampled people from the California counties. We have used 6 counties for the 1 and 2 counties for the 2 . The sample size of 1 is about 1500 and that of 2 is about 300; so 1 is five times the size of the 2 (i.e., the 2 is relatively small although it is expected to give unbiased estimate) . Our 1 data, presented here, defect correlation ̂=0.006, which means that it has a large selection bias or defect (Meng, 2018).
We have non-Bayesian methods, LS and DR estimators, and the Bayesian method. Each of these uses the three data sets, separately, 1 , 2 , and 3 for estimation. The bootstrap is used to get distributions for LS and DR estimators. As stated earlier, we obtain 1 , 2 and 3 from NCHS BMI data. For bootstrapping or the Bayesian method, we generate 1,000 population means of ̅ . The distributions of these finite population means for the BMI data from 1 , 2 , and 3 , are shown in in Tables 3, 4, 5 and Figure 1.

LS Estimation
The finite population mean of BMI is 27.175 for the nonprobability sample 1 , 25.985 for probability sample 2 , and 26.982 for the combined data 3 . It also shows the bias of the means of 1 and 3 are 1.190 (=27.175 -25.985) and 0.997(=26.982-25.985), respectively. The sample size of 2 is the smallest 2 =304 and that of 1 is 1 =1,563 while the combined sample size is 3 =1,867. But the SDs remain the same, very small because all the three sample sizes are rather big, i.e., 304, 1,563, and 1,867, respectively. We used bootstrap, which is distribution free, to find the realistic variances in Table 4. Note that some of CVs* of the ′s are greater than 30%, for sex and race. But the contribution of sex and race to BMI are bigger, 0.710 and 0.558 than age contribution, 0.036 for 1 . This trend remains the same for other two tables 4 and 5. Note CV of race in the table of 2 , is very large, 1154%, mainly because of the small 3 ( 0.052) and large SD (0.609). Finally, we note that the SD of the finite population mean is extremely small and this is due to the large sample size that appears in the variance of the least square formula. Table 4 gives us the bootstrap estimations of the finite population means and variances of ̅ , B=1,000 for 1 , 2 , and 3 . For 1 , 2 , and 3 in the last row of Table 4 we add the bootstrap variance of LS estimation as the LS variances were too small. The SD of 1 and 3 are 0.141 and 0.127, respectively, while the SD of 2 is 0.298 which is almost twice bigger than those of 1 and 3 .This reduction is largely due to the large sample sizes. As a DR estimator of the finite population mean and SD for our data, 3 , we got BMI mean ̅ 2 = 26.161 (SD =0.524) which is smaller than the ̅ 3 = 27.093 (SD =0.267) in Table 4 with much smaller variance. Note that this is different combination from the one used in the LS or Bayesian, and that ̅ 2 , and ̅ 3 are defined in DR method in Section 3.2 We assume the probability sample provides unbiased estimates of mean. The mean is 25.995 from the probability sample 2 compared to the means 27.264 of nonprobability sample S 1 , and to 27.093 of the combined sample 3 . Note that the bias of nonprobability sample is 1.269 (=27.264 -25.995) and the bias of combined sample is 0.981 (=26.976 -25.995), about three standard deviations.

Bootstrap and DR Estimation
On the other hand, the SD of 2 is 0.390 while those of 1 and 3 are much smaller at 0.284 and 0.249, respectively. Here SD reflects sample size. i.e., SD of S 2 is 0.390, bigger than those, 0.284 and 0.249 of respective for 1 and 3 . Some of the CV's of the 's are greater than 30%, especially race and sex although the race and sex contribute more to the BMI than age for all three samples. But CV is not a good estimator as described above. SD is better for statistical inference than CV. The SD of 2 is much bigger than those of 1 or 3 while the bias of the mean of 1 is 0.900 (=26.974-26.074) (3.45%) and that of 3 is 0.813 (=26.887-26.074) (3.12%) when compared to the mean BMI, 26.074, of 2 .The CVs of ′ for sex and race are 40.5* %, and 33.5%, respectively, for 1 , the CVs are 50.61* %, and 95.74* % for 2 , and CVs are 25% and 71.99* % for 3 . This means they are unreliable estimates except one 25%. As seen in these tables, the pattern is consistent regardless of the method used for the estimation of finite population mean. i.e., larger variance with 2 , compared to other data sets, 1 or 3 , and larger bias with 1 or 3 compared to that of 2 . Figure 1 below includes seven density curves, A,B,C, and D,E,F,G. They are informative about the distributions of finite population mean by three data sets, 1 , 2 , 3 , and three methods: Bayesian (ABC), bootstrap (DEF), and DR (G) that uses both data sets, not 3 , and bootstrap. Vertical axis is for the heights of the distribution of population means and horizonal line is scale of BMI. The seven curves depend on both methods as well as sample size. A,D use 1 , A with Bayesian and D with Bootstrap. B,E use 2 , B with Bayesian and E with Bootstrap, and C,F use combined sample of S 3 = (S 1 U S 2 ), C with Bayesian and F with Bootstrap. The G curve gives the bootstrap distribution of DR estimator.

Bayesian Estimation
They show the bias of the means, i.e., B,E curves of probability sample 2 , are assumed to have unbiased mean, which is used to find the size of bias of mean estimates from other two samples 1 and 3 . The curves of B,E are shifted to the left side of other curves. This implies that the curves A,D of nonprobability sample, those of C,F of combined samples, and the double robust curve of G with combined sample, are biased to the right side of unbiased curves B,E. On the other hand, the bases of B,E of 2 are wider than those of others, 1 and 3 , implying that variance of B,E are bigger than those of others due to smaller sample size. The two highest ones among the seven curves are C of 3 and A of 1 . The shortest ones are B,E of 2 because of its smaller sample size.

Conclusion
Original sample weights are available in probability sample, but they are missing for nonprobability sample. We used these known weights to impute the unknown weights of nonprobability sample. Others use logistic model to find the propensity score to get unknown weights. But we use covariate matching via Mahalanobis distances to avoid unnecessary parametric assumptions.
We show a way to reduce possible bias arising from not using a sample design via ASWs obtained from the OSWs (known or imputed). The combined data help to reduce the variance compared to that of 2 or 1 alone. Therefore, it is another benefit of our methods, a potentially useful for data integration.
The finite population mean is estimated by two different methods using both the non-Bayesian and Bayesian approaches. In the non-Bayesian approach, we use least squares estimators and the DR estimator. Bootstrap is used to calculate variance for the LS and DR estimators, especially when variance is too small for large N. We also obtain the distribution of the finite population mean using the bootstrap in the non-Bayesian approach. With the two methods, non-Bayesian and Bayesian, each method separately uses three different data sets, 1 , 2 , and 3 to find the finite population mean. The DR estimator combines 1 and 2 in a different way, not the combined sample S 3 and Bootstrap method, as the result shown in the last row of Table 2.
Each data set gives an estimate of the finite population mean. The estimates are close within normal range between 26 to 27 regardless of the data types and methods used. The tables and Figure 1 show the significant bias of the means from 1 and 3 compared to that of 2 . The variances of finite population mean from 1 and 3 are smaller than that of 2 . These results show consistent pattern, reduced bias and variance regardless the method used.
The estimates of regression coefficients of sex and race have unreliable CVs except age although they contributed more to the BMI estimation. Since CV is not a good measurement of reliability, one needs to be careful to use CV for statistical inference.
We can extend our method to include a second probability sample which does not have the study variable. We can now use record linkage to fill in the missing weights in the nonprobability sample and the missing study variable in the probability sample. Therefore, there can be four data sets to do data integration. Our methods can be applied much the same way. Chen et al. (2020) used two data sets, non-probability sample without survey weights and probability sample without study variable. To obtain survey weights in the nonprobability sample, they used logit function. It is also possible to have a nonprobability sample without both the survey weights and the study variable. The study variable from the probability sample, if available, can be used to impute missing variables using record linkage.
It is more sensible to use the probability sample, supplemented by the nonprobability sample, to obtain the finite population mean. However, because the nonprobability sample is expected to be biased, together with its relatively large size, it will shrink the probability sample away from its expected unbiased position. Therefore, a method is much needed to partially penalize the nonprobability sample. This is one of areas in nonprobability sampling and data integration with a probability sample that is of enormous current practical interest. We have been working on this topic.
Truth is simple, but hard to find, and this is also true in statistics. This paper is trying to find a true finite population mean with nonprobability sample supplemented by a small probability sample. We hope that our efforts are a small step forward to find the truth. Data fusion (Castanedo, 2013) is widely used in practice to fill the missing data. There are many ways to perform statistical data fusion; see Kedem, Oliveira and Sverchkov (2017). Data fusion, combination or integration is very important in today's world. For example, to build self-driving cars, the more information (large N), the safer cars can be.
Here one can use all sources of information for safer self-driving cars.
Mahalanobis distance (Stuart 2010) can be used to impute unknown weights of nonprobability sample with known weights of a probability sample.
We show how to impute the unknown survey weights in nonprobability sample using covariates matching or record linkage via Mahalanobis distance. We simply consider the two samples, 1 with covariates, 1 , i = 1, . . . , 1 only, and 2 with covariates , i = 1, . . . , 2 and survey weights, 2 , i = 1, . . . , 2 , where ∑ 1. For each i, find the smallest value , j = 1,. . . , 2 . Because of discreteness, it is possible that there are more than one unit with the smallest distance.
2. Randomly sample one of the units in 2. Suppose this value is k. Then, the weight assigned to unit i is 2 , which we denote by 1 * = 2 . Repeat this step for all units in the 1 .
These are surrogates for the original weights. 4. Finally, compute the ASWs.

Appendix C. Proof of theorem 1
Now we drop the subscript of sample number since our discussion is applicable in general. Here lower case are the adjusted sample weights and upper case are the original weights.
Suppose the population 1,, … , | ~ f(y| ). Take an independent sample. 1,……, , | ~ f(y| ), is parameter and f(.) is general function linking to y. Then, the log-likelihood for the entire population is ∑ log f(y i | ) N i=1 , and the Horvitz-Thompson estimator of the log-likelihood is given by ∑ W i log f(y i |θ) n i=1 .
Exponentiating, we have pseudo-likelihood, ∏ f(y i | ) W i n i=1 . We make the two adjustments to this pseudo-likelihood. First, to reflect the correct variability we replace the original weights by the adjusted sample weights , so we get ∏ f(y i | ) w i n i=1 . This first step can be presented more rigorously, but this is not necessary. Second, following a full Bayesian approach, we need to normalize this density to get It is worth noting that this pseudo density is a new formulation as it includes the normalization constant, thereby providing a proper density. This is different from what is presented in the literature, and it should make a difference when normality does not hold.
For example, suppose we have an independent sample 1,…, , | , 2 from Normal( , 2 ) taken with unequal selection probabilities. Then the joint probability density becomes That is, the sample model with above adjusted weights is given by