Using a Case-Control Genotypic Testing in Investigating the Association with Type-2 Diabetes

In the United Kingdom, Type-2 Diabetes (T2D) is the leading cause of blindness among working age adults. It is also known to cause kidney failure, amputations and cardiovascular diseases. In this study, genetic association tests were used to compare genetic variants carried by individuals against their disease status, with the aim to find genes that contributed to the risk of T2D. The identification of these genes could be of great importance especially in preventive healthcare measures. This study used a case-control genotypic test to find the association between Single Nucleotide Polymorphisms (SNPs) on chromosome 10 and T2D. SNPs are a type of polymorphism that occurs when a single nucleotide (A, C, G, and T) in the genome is substitute for another. At the beginning of the study, we had a total of 28,501 SNPs, however, 4,101 SNPs were removed after conducting both the Hardy Weinberg Equilibrium test and the control of Minor Allele Frequency in the preliminary analyses. These quality controls were done to remove SNPs that may lead to false associations. A total of 24,400 SNPs were left for association testing using the genotypic test of the 2 × 3contingency table. Our testing revealed that there were a total of 12 SNPs that had potential association with the risk of T2D.


Type-2 Diabetes
The number of people with diabetes mellitus in the world's population may be expected to double, from around 180 million to 300 million by 2025 (Zimmet et al., 2001).Diabetes Mellitus, also known as diabetes, is a long-term disorder caused by an absence or deficiency of insulin, which led to a significantly high level secretion of glucose (sugar) in the bloodstream (World Health Organization, 1999).Insulin, produced in the pancreatic β-cells, is the key hormone responsible in regulating glucose level in the bloodstream.As mentioned by the WHO, there are two main forms of diabetes: Type-1 and Type-2 diabetes.
The Type-1 Diabetes (T1D) is a condition where the pancreas is unable to secrete any insulin because the body's immune systems have destroyed the insulin-producing cells (β-cells).While, the Type-2 Diabetes (T2D) is a condition where the β-cells produces insufficient insulin to control the level of glucose in the blood, or the body's cells especially in muscle, fat and liver cells do not react effectively to insulin produced (known as insulin resistance).Both these diabetes have been known to result from the combination of both genetic and environmental risk factors.Thus, termed as 'complex' diseases.
In the United Kingdom, diabetes is the leading cause of blindness in people of the working age (Arun et al., 2003), and the main contributor to kidney failure, amputations and cardiovascular disease, including heart attack and stroke (Diabetes UK, 2014).The T2D usually develop at the age of 40 and it accounts for approximately 90% of all adults affected with diabetes, making it the most common form of diabetes.The T2D are known to be predominant for people of South Asian, African-Caribbean or Middle Eastern descendants, as it can start as early as the age of 25.

Genetics & Molecular Biology
In this section, some of the genetic terms that will be used in this study are described.In addition, most of the definitions are taken from the webpage for the National Human Genome Research Institute (NHGRI) (Note 1).
It is known that T2D have a strong hereditary component.The gene is the fundamental physical and functional unit of inheritance, passed from parents to offspring.In addition, the genes contain the information to explain specific traits.They are arranged in a linear manner, on structures called chromosomes.An allele is one of two or more versions of a gene.There are two alleles at each gene in every individual with one allele inherited from each parent.If there are two copies of the same alleles, the individual is homozygous for that gene.The individual is heterozygous if the two alleles are different.Phenotype refers to an individual's observable outcome or traits such as eye colour or the disease status.In our study, we refer the phenotypes as either having T2D or not.In contrast, genotype refers to the individual's unobserved genetic contribution to the phenotype.A locus (plural, loci) is the physical location of a gene of interest on a chromosome.The entire set of individual's genetic constitution is called genome.In humans, the genome usually consists of 23 pairs of chromosomes.
In general, the risk of developing a disease can either be the result of a single major gene, called 'monogenic' or the combination of small effects from multiple genes, called 'polygenic'.Consider the case where a disease is caused by a genetic variant in a locus in two distinct forms (alleles): d, the normal allele, and D, the disease susceptibility allele.There would be three possible genotypes at a single biallelic locus: dd, dD and DD, with dd representing the normal genotype.The penetrance function is the set of probability distribution functions for the phenotype given the genotype.This is given by,

(
) ( ) where G is genotype and Y is phenotype assumed binary: 0 indicating unaffected and 1 indicating affected.
The concept of dominance can be categorized into three genetic risk models: dominant, recessive and codominant.In dominant model, allele D (disease susceptibility allele) is dominant over d which led to genotype dD and DD to have the same effect on the phenotype, (Pr (Y|dD) = Pr (Y|DD)).In recessive model, where allele D is recessive over d, genotype dD have the same effect on the phenotype as genotype DD, (Pr (Y|dD) = Pr (Y|dd)).When the allele is neither dominant nor recessive, each genotype has different effect on the phenotype (Pr (Y|dd) ≠ Pr (Y|dD) ≠ Pr (Y|DD)).In most cases, the heterozygote (dD) has an intermediate effect between that of the two homozygotes (dd and DD).This brings us to additive cases there P (Y|dD) is a midway between P (Y|dd) and P (Y|DD).In this study, we will be assuming the genetic risk model of T2D to be codominant.
A nucleotide is the basic building block of nucleic acids, which consists of phosphate, sugar and bases.The four bases in the DNA are Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).The most abundant genetic variations in the human genome are the single nucleotide polymorphisms.Single nucleotide polymorphisms or SNPs (pronounced 'snips') are a type of polymorphism that occurs when a single nucleotide (A, C, G, and T) in the genome is substituted for another.This is illustrated in Figure 1.There is about 3 billion nucleotide in the human genome and approximately 10 million of these will have SNPs occurring.If there are only two variants occurring at a single locus, it is called biallelic.Most SNPs will have no effect on health or development.However, some of these genetic variations (such as SNPs) have been proven to be harmful and may affect the risk of developing diseases.This is the interest of our study and we will only be dealing with biallelic SNPs.A haplotype is a group of single nucleotide polymorphisms (SNPs) residing on the same chromosome which tends to be inherited together.

Objective of the Study
The completion of both the Human Genome Project in 2003 and the International HapMap Project in 2005 have made it possible for researchers to have the tools to detect genes that are associated with the risk of common diseases (NHGRI).In 2007, the Wellcome Trust Case-Control Consortium (WTCCC, 2007) reported their findings on the Genome Wide Association Studies (GWAS) they conducted on the 7 common diseases: coronary heart disease, T1D, T2D, rheumatoid arthritis, Crohn's disease, bipolar disorder and hypertension.They were successful in finding many new disease genes which are associated with these diseases.
In this study, we are using data from the Wellcome Trust Case Control Consortium (WTCCC, 2007) but only focusing on the T2D.The data consists of 3499 individuals: 1999 individuals with T2D ('cases') and 1500 normal ('controls').The T2D cases are identified as UK Caucasian.There were two control groups used in the original study but we will only be using one, which is from the National Blood Service (NBS).These are made of individuals living within England, Scotland and Wales.Due to our limited computer capacities and time constraint for data analysis, we will only be analysing SNPs in chromosome 10, which is a total of 28,501 SNPs.The original study had analysed the whole genome and had identified chromosome 10 to be the region which generates the strongest association signal for T2D.This is our main reason for choosing chromosome 10.Using these large numbers of bi-allelic SNPs, the objective of our study is to identify the SNPs that are associated with the risk of developing T2D for the British Caucasian population.Most of the statistical analyses were performed using the software package PLINK 1.07 (Purcell et al., 2007).Other software used were R (R Development Core Team, 2010) and Haploview (Barrett et al., 2005).

Preliminary Analysis
The quality control filtering process is a crucial first step in genetic association study to ensure the quality of the genotype data.We will need to remove low quality SNPs that may lead to false association.The two quality control processes are the Hardy Weinberg Equilibrium (HWE) test and Minor Allele Frequency (MAF).

Hardy Weinberg Equilibrium
The Hardy Weinberg law states that the genotype and allele frequencies of a large, randomly mating population remain constant from one generation to the next provided migration, mutation, and natural selection do not take place (Ziegler & Konig, 2006).Consider a biallelic locus with alleles D and d.Denote f (D) and f (d) as the allele frequencies of D and d respectively.There are three possible genotypes: DD, Dd and dd.For each locus, we can deduce two predictions by HWE.Firstly, the allelic frequencies in a population will not change from generation to generation: p = f (D), q = f (d); and secondly, the genotype frequencies remain constant after one generation in the proportions.A significant deviation from HWE for a SNP could be due to non-random mating, selection or genotyping error.Our major concern is the genotyping error where heterozygotes are most commonly misclassified as one of the homozygotes.For example, AT are misclassified as AA or TT.Genotyping errors may lead to a false association.
We performed the HWE test for the control group only because we used a case-control test, where deviation from HWE in the case is interpreted as having an association with the risk of T2D.We tested HWE for each SNP by comparing the observed genotype frequencies with those expected under HWE.The null hypothesis, H 0 state there is no significant difference between the observed and the expected genotypic frequencies under HWE.
The alternative hypothesis H 1 is that there is a significant difference between the observed and expected genotype frequencies.The commonly used approaches are the Pearson's chi-squared goodness-of-fit test and the Fisher exact tests (Agresti, 2013).Fisher exact test is preferred as it is computationally more demanding but can be computed easily in PLINK.

HWE test: Using chi-squared test
Consider a sample with n individuals, and denote the observed genotype frequencies of DD, Dd, and dd as f (DD), f (Dd), and f (dd) respectively.Denote f (D) and f (d) as the allele frequencies of allele D and d, given by 2n Under the null hypothesis, the expected genotype frequencies for DD, Dd, and dd are given by and the number of degrees of freedom, v are calculated by Number of expected genotype Number of alleles v = − The X 2 test statistic asymptotically follows a chi-squared distribution with one degree of freedom, where we can obtain the corresponding p-value and compare it with the threshold to make conclusion about HWE.
Using the example of SNP-A_4300367, consisting of 1500 individuals (controls), n =1500.The genotype AA, AG and GG have the observed genotypic frequencies of 328, 763 and 409 respectively.Denote f (A) and f (G) as the allele frequencies of A and G respectively, given by 2 328 763 0.473 2 1500 2 409 763 0.527 The observed and expected genotype frequencies are summarized in Table 1.The corresponding p-value for 2 0.6187 X = on 1 degree of freedom is 0.4315.In this analysis, we had set the threshold of p-value at 7 5.7 10 − × .Since the p-value is higher than the significance level, there is not enough evidence to reject the null hypothesis.We can conclude that SNP_A-4300367 is in Hardy Weinberg Equilibrium.For the Fisher exact test, it is impossible to calculate this example without the help of statistical software.Hence, in our study, we conducted the Fisher's exact test in PLINK for the analysis of HWE.

Minor Allele Frequency
The Minor Allele Frequency (MAF) is the frequency of the less frequent allele (minor allele) over the total allele in that sample.In this study, we assume common SNPs with MAF greater than 0.01 are to be the cause of the common diseases like T2D.Therefore, common SNPs are more likely to reflect true associations than rare SNPs since there is a greater power to detect common SNPs.However, studies have shown that rare SNPs may have contributed to the risk of common diseases (Bodmer & Bonilla, 2008).We will not be analysing these rare SNPs in our study.In fact, we are excluding SNPs with MAF value less than 0.01, otherwise known as the 'rare SNPs'.For example, in SNP_A-4252987, we have 8 count for allele T and 6991 for allele G. Thus allele T is the minor allele and its minor allele frequency, Since the MAF value is less than 0.01, we will exclude this SNP from further analysis.

The Association Testing
The genetic association studies aim to find the relationship between the genetic variants carried by individuals and their phenotypes.In this association study, it requires the criteria that individuals to be tested is distant or of unknown relationship with each other.Subsequently, we used the biallelic single nucleotide polymorphism (SNP) as the genetic variant and look for its relationship with the risk of traits T2D.Specifically, we used the case-control test of association.It compared a single SNP with the disease status (cases or controls) and derive conclusion when a particular SNP is in significant abundance in cases or controls.
The most powerful test for detecting association is one which uses the true underlying genetic risk model.However, the true genetic risk model is not known and choosing the wrong genetic risk model will lead to a very low power of detecting association.In this study, we only used the genotypic test which assumed a codominant risk model.Codominant risk model can either be general or additive.We will not be performing tests based on dominant or recessive risk model.

Genotypic test, 2 × 3table
In a case-control setting, we can represent the genotype data in a 2 × 3contingency table where each individual is classified according to their disease status and the genotype they carry.Here, we denote D and d for the major and minor allele respectively.The three possible genotypes are DD, Dd and dd representing homozygotes, heterozygotes and rare homozygotes respectively.This construction is shown in Table 2. To test for association between a SNP and disease, we can carry out the usual 2 χ test for independence of rows and columns in contingency tables.The null hypothesis, H 0 : The SNP has no association with the disease (no association between row and column of the table) against the alternative hypothesis H 1 : The SNP has an association with the disease.This is called genotypic test.
The chi-squared test statistic is given by and the number of degrees of freedom, v is given by where k is the number of genotypes.Here, X 2 has 2 χ distribution with 2 degrees of freedom under the null hypothesis.For example, SNP_A-2061203 has genotype counts of 751, 926, and 322 for CC, CT and TT and the corresponding p-value on 2 degrees of freedom is 0.8453.At a significance level of 5.7 ×10 −7 , we do not have enough evidence to reject the null hypothesis and hence concluded there is no association between the SNP (SNP_A-2061203) and the risk of T2D.
H 0 :The SNP SNP_A − 2061203 ( ) has no association with the risk of T2D.
H 1 :The SNP SNP_A − 2061203 ( ) has an association with the risk of T2D.

Hardy Weinberg Equilibrium (HWE) Test
Using PLINK, we ran the HWE test using Fisher's exact test on each and every 28,501 SNPs simultaneously to generate the observed HWE p-values.We only applied the HWE test on the control group.

The Genotypic Test
The genotypic test on a 2 × 3contingency table is performed on the remaining 24,400 SNPs, which have passed both the HWE and MAF tests.A histogram of the genotypic p-values can show its distribution across the whole SNPs.A quantile-quantile (QQ) plot is constructed by rearranging the observed negative log genotypic p-value ( −log 10 p ) from the smallest to largest and plotted them against the expected negative log genotypic p-value, the expected value are assumed to have a chi-squared distribution with 2 degrees of freedom.A red line is added to the QQ plot to indicate the expected value under the null hypothesis.The QQ plot can show SNPs that deviate from the null hypothesis and hence having an association with the risk of T2D.Both of these plots are shown in Figure 7.In Figure 7 (b), we can see some SNPs showing a big deviation from the expected value.The SNPs of interest are the ones with p-values less than 5.7 ×10 −7 , where the distribution of the genotype is considered not random and has an association with the risk of T2D.To address the multiplicity issues, we performed the Bonferroni, Sidak, Holm and the adjustment by False Discovery Rate (FDR) corrections to the genotypic p-value (Blakesley et al., 2009;Bretz et al., 2005).The SNPs are arranged according to the p-value, from the smallest to largest.The distribution of the top 17 p-values under the genotypic association test is shown in Table 3.In this table, '-' means the p-value is higher than 5.7 ×10 −7 .
Initially, there are 35 SNPs with genotypic p-value less than 5.7 ×10 −7 .After applying the Bonferroni, Sidak, Holm and FDR corrections, there are 11, 11, 11 and 17 SNPs with p-values less than 5.7 ×10 −7 respectively.As shown in Table 3,  For all the 17 significant SNPs, we constructed the Linkage Disequilibrium (LD) plot within the range of ±20kb(or 50 kb) base position of the SNP using Haploview.This is to identify the presence of LD between the SNPs. Figure 9

Discussion
We had a total of 28,501 SNPs at the start of our study.The 411 SNPs in the control group had HWE p-value of less than 7 5.7 10 − × which could be the cause of genotyping error.Since they may lead to false association, they were excluded from further analysis.After excluding the 411 SNPs, we found 3690 SNPs had MAF of less than 0.01.Since we are assuming common SNPs (with MAF > 0.01) to be the underlying cause of T2D, these 3690 rare SNPs were also removed from further analysis.Therefore, in the preliminary analyses, we have removed a total of 4101 SNPs, which may potentially lead to false association.A total of 24,400 SNPs are left for association testing.The first test of association we did was the genotypic test of the 2 × 3contingency table.This was assuming that the risk of T2D is general or non-additive.Here, we found 35 SNPs that were significant at a threshold of 5.7 ×10 −7 , but later reduced to 17 SNPs after FDR corrections and subsequently reduced to 11 SNPs after the conservative Family-Wise Error Rate (FWER) corrections.The 6 SNPs (SNP_A-2005462, SNP_A-1793595, SNP_A-2130702, SNP_A-1893548, SNP_A-2298644 and SNP_A-1809965) that passed the FDR but failed FWER correction were all found to be in high LD with each other.In this case, the alleles of the SNPs were dependent on each other and therefore FDR is a better correction than FWER.It was sufficient to analyse only one of the 6 SNPs.The two SNPs (SNP_A-1780818 and SNP_A-4271922) were found to be in high LD (D' > 95) with neighbouring SNPs that we had declared had no association with the risk of T2D.This may not necessarily conclude that the two SNPs are of false association.The remaining 9 SNPs were not in high LD with any of the neighbouring SNPs.This could either mean the SNPs are the true causal variants directly influencing the risk of T2D, or they are of false association.We then had a total of 12 SNPs with a potential association to be further analysed.The 12 SNPs are SNP_A-1780818, SNP_A-1811809, SNP_A-2147541, SNP_A-1859018, SNP_A-4271922, SNP_A-1833194, SNP_A-2212963, SNP_A-4280587, SNP_A-2083541, SNP_A-2155029, SNP_A-4273904 and SNP_A-2005462 (one of the 6 SNPs which was in LD).The genotypic test is preferred (most powerful) when the genetic risk model is not additive.
In the original WTCCC study, there was a cluster of SNPs in the base position of range 114.5-115.0Mb in chromosome 10 that showed strongest association signal for T2D.It was represented by the reference SNP rs4506565 (or SNP_A-2005462).This SNP was said to be in LD with rs7903146, the variant with the strongest aetiological claims.Additionally, it was proven that rs7903146 had caused rs4506565 to give strong association signals.In our study, when we used the genotypic test of association, we found 6 SNPs in the base position of range 114.5-115.0Mb.The alleles of these 6 SNPs were in high LD with each other, one of the SNPs was rs4506565 (or SNP_A-2005462), and coincidentally was the one found in the original study.This SNP had a moderate p-value of 6.618 ×10 −8 (after FDR correction).The original study had used two control groups, an extensive quality control measures as well as supplementary approaches in the case-control tests of association.Nevertheless, our study may not be powerful and may be considered inconclusive.Further tests and experimental investigations are required to make conclusion on the remaining 11 SNPs that has potential association using genotypic test.Original study (WTCCC) 5.05e-12

Conclusion
The objective of the present study was to find SNPs that has an association with the risk of T2D for the British Caucasian population.In the present study, we assumed the common SNPs to be the cause of T2D and excluded the rare SNPs (with MAF < 0.01) before performing any tests of association.In doing so, we might have excluded SNPs that had genuine association with the risk of T2D.Furthermore, we used tests of association using the assumption that the risk model will either be general or additive.In fact, there was a possibility that the true risk model was dominant or recessive.Since the true genetic risk model for T2D is not known, we were unable to decide the best test of association to use.There were a lot of assumptions being made when conducting this study and it may raise questions on the quality of the results.In the original study, only one SNP (SNP_A-2005462) showed a strong association signal for T2D, which was proven to have indirect association to the risk of T2D.In our study, when we assumed the genetic risk model to be general and applied the genotypic test of association, we were successful in identifying SNP_A-2005462 to have potential association.Even though we took a different approach and made a small comparison to the original study, we did manage to find the important SNP.However, we identified an additional 11 SNPs under the genotypic test as having potential association.Further analyses using other case control testing may be required to justify whether there may be other direct associations or false associations.Although, this present study is very limited, it still highlights the potential of finding the SNPs that have a true association with the risk of T2D.
of degrees of freedom, v = Number of expected genotype − Number of alleles = 3 -2 =1 Figure 2. The genotype counts for SNP2061203 Figure 3. (a) Histogram of the HWE p-values under null hypothesis (no SNPs excluded), (b) QQplot for HWE p-values (no SNPs excluded) Figure 7. (a) Histogram of the p-values under genotypic allelic test, (b) QQ plot of the observed against the expected log (p-value) under genotypic test

Table 1 .
Distribution of the observed and expected genotypic counts

Table 2 .
The general 2 × 3contingency table of the genotype counts

Table 3 .
Distribution of the raw and adjusted p-values under the genotypic association test

Table 4 .
Table 4 below shows the comparison of p-values found in this present study and the original study.Information on SNP_A-200546 or rs4506565 (the p-values in the present study are calculated after applying FWER correction)