Summary of QC process

An outline of the QC process is provided below:

  1. Manually processed in GenomeStudio and exported as a report file
  2. The report file has been processed through our automated genotyping QC pipeline which:
    • Identifies duplicate samples and renames if present (duplicated IDs are returned to original ID at end of pipeline)
    • Removes samples with a call rates below 0.95
    • Removes problematic SNPs that were identified in GenomeStudio
    • Calculates optimal zCall threshold and uses it to apply the zCall algorithm
    • Converts the zCall output file to PLINK binary format
    • Updates A/B alleles to Illumina TOP strand based on Illumina chip used for genotyping.
    • Creates a further processed file based on an iterative SNP call rate of 95% and sample call rate of 0.98%
    • Identifies sex discrepancies
    • Identifies possible related samples
    • Identifies heterozygosity outliers
    • Estimates ancestry

Further information on individual files, including the final processed data, is provided in Content Structure


General information and overview of statistics

  • GenomeStudio QC SOP
  • Genotyping QC pipeline SOP
  • Genotype chip used: HumanCoreExome.24v1.0_A.csv
  • Genome Build: Genome Reference Consortium Human Build 37 (GRCh37). Also known as hg19
  • Number of SNPs prior to QC: 547644
  • Number of SNPs failing GenomeStudio QC: 2184
  • Number of SNPs after QC: 545460
  • Number of samples prior to QC: 288
  • Number of duplicated samples: 8
  • Number of samples removed due to call rate below 95% (includes samples failing GenomeStudio QC): 12
  • Number of samples after QC: 276
  • Number of samples with call rate between 95-98% (not removed): 1
  • *Number of gender mismatches identified: 5
  • *Number of samples to be aware of due to heterozygosity: 3
  • *Number of unique related samples identified: 8

*IMPORTANT NOTE: Only SNPs that were deemed unreliable in GenomeStudio and samples that had a call rate below 95% were removed from this data. The list of samples removed (if any) can be found under the Sample Call Rate section of this summary report and within the "6.Samples_SNPs_removed" folder. The remaining SNPs and samples were further processed using zCall to create the "final processed data" which is located in the "8.FINAL_QC_DATA" folder. Additional SNPs and samples (SNPs <95% and samples <98%) were only removed to calculate the gender mismatches, heterozygosity, related samples, and ancestry statistics. These SNPs and samples have been reintroduced in the "final processed data", but a list of these samples are detailed under the Overview of potential problematic samples section of this summary report. If conducting a GWAS study, the "final processed data" can be processed analysed using the recommended guide Coleman et al. (2016) which also contains a GWAS codebook


Sample QC

Duplicate Samples

If duplicate samples existed in the data, they were renamed to append "duplicate" followed by the duplicate number. For instance, if "sample-name" was duplicated three times, then two of the duplicates would be renamed as "sample-name-duplicate-1" and "sample-name-duplicate-2". This allows PLINK to treat each sample independently. If a sample ID was changed, it was reverted back to its original ID after QC in final processed .fam file, however, a ".fam_with_dup_ID" is also provided which contains the changed sample IDs which can also be used by the user.

The following sample IDs were changed:

NOTE: This is an interactive table where samples can be explored

ZCall Threshold

The zCall algorithm is a rare variant caller specifically designed for calling rare single-nucleotide polymorphisms from array-based technology Goldstein et al. (2012). The algorithm uses the intensity profile of the common allele homozygote cluster to define the location of the other two genotype clusters. The algorithm only assigns genotype calls to SNPs which have not been called by GenomeStudio, effectively increasing the quality of the data.

Samples with a call rate below 95% are removed and the remaining samples are processed using the zCall algorithm using a z-score threshold. This z-score threshold is calculated from common SNPs (MAF > 0.05) using a linear regression model that relates to the standard deviations of the X and Y intensities. To find the best z-score threshold, common sites are recalled using various values of z to find the best overall concordance with GenomeStudios GenCall algorithm.

The concordance of various z-score thresholds is shown below:

Global concordance of Illumina's Gencall and zCall

Global concordance of Illumina's Gencall and zCall

The optimal z-score was identified as 5, and was used to process the data

Sample Call Rate

The sample call rate is the proportion of SNPs with an assigned genotype from all SNPs available. Samples with a low sample call rate can indicate poor DNA quality and therefore should be removed from further analysis Laurie et al. (2010). Initially samples with a call rate below 90% were removed in GenomeStudio as these can interfere with the GenCall clustering algorithm. Following GenomeStudio QC, samples with a call rate below 95% were then removed prior to further processing with zCall.

The total number of samples removed from the data that have a call rate below 95% is 12. Further information on these samples is provided below:

NOTE: This is an interactive table where samples with a low call rate can be explored (if any). Samples with a call rate of 0 were removed during the GenomeStudio QC phase as these had a call rate of < 90% or were identified as problematic



The distribution of sample call rates remaining in the data after QC is shown below:



The recommended sample call rate cut-off for a typical GWAS study has been suggested to be between 98-99% Turner et al. (2011) Coleman et al. (2016). The number of additional samples that would fail the minimum sample call rate threshold of 98% (based on iterative SNP/sample removal with SNP call rate threshold set to 95%) is 1. These samples are detailed below:

NOTE: This is an interactive table where samples can be explored

Sex Discrepancies

The sex chromosomes were thoroughly processed in GenomeStudio prior to processing through the genotyping QC pipeline. SNPs and samples were iteratively removed using a SNP call rate threshold of 95% and sample call rate of 98%. Therefore, sex discrepancies is only available for samples with a call rate above 98%. The data was then pruned and sex discrepancies were calculated according to thresholds specified in Coleman et al. (2016).

The F statistics is calculated for all samples, where F describes the statistically expected level of heterozygosity on the X chromosome. Samples with an F statistics above 0.8 are presumed to be genetically "Male" while samples with an F statistics below 0.2 are presumed to be genetically "Female".


The distribution of the F statistics for clinically assigned Males is shown below:

NOTE: This is an interactive plot where samples can be identified by hovering the mouse over the image


The distribution of the F statistics for clinically assigned Females is shown below:

NOTE: This is an interactive plot where samples can be identified by hovering the mouse over the image.


A total of 5 sex discrepancies have been identified and are listed below:

NOTE: This is an interactive table where samples with a gender mismatch can be explored.

Heterozygosity Outliers

Samples that deviate from the expected heterozygosity, when compared to overall heterozygosity rate of the study, can aid in the identification of problematic samples. High levels of heterozygosity can indicate samples of low quality while low levels of heterozygosity can be due to inbreeding Marees et al. (2018).

SNPs and samples were iteratively removed using a SNP call rate threshold of 95% and sample call rate of 98%. Therefore, heterozygosity outlier identification is only available for samples with a call rate above 98%. The data was then pruned and heterozygosity outliers were calculated using methods described in Coleman et al. (2016). A 3 standard deviation (SD) threshold was used to identify outliers.

The distribution of the heterozygosity rate is shown below:


NOTE: This is an interactive plot where samples can be identified by hovering the mouse over the image


A total of 3 heterozygosity outliers were identified and are listed below:

NOTE: This is an interactive table where samples can be explored

Identity by Descent

Identity-by-descent (IBD) calculates how strongly a pair of individuals are genetically related. A typical GWAS study assumes all subjects are unrelated, therefore, closely related samples can lead to biased errors in SNP effects if not correctly addressed Marees et al. (2018). If self-reported relationship information is available, then IBD can be used to identify potential sample mix-ups and/or cross sample contamination.

SNPs and samples were iteratively removed using a SNP call rate threshold of 95% and sample call rate of 98%. Therefore, IBD identification is only available for samples with a call rate above 98%. The data was then pruned and IBD calculated using methods described in Coleman et al. (2016). Individuals with an IBD pi-hat metric over 0.1875 (halfway between a second and third degree relative) have been identified as related. The z0, z1 and z2 metrics indicate the proportion of the same copies of alleles (z0 = 0 copies, z1= 1 copy and z2 = 2 copies) shared between two individuals, with PI_HAT calculated as P(IBD=2) + 0.5*P(IBD=1). These metrics can be used to estimate the type of relation as described in Buckleton, Bright, and Taylor (2018), which suggest the following can be used:

Relationship z0 z1 z2
Parent/child 0 1 0
Full siblings 1/4 1/2 1/4
Half siblings 1/2 1/2 0
Grandparents/grandchild 1/2 1/2 0
Uncle/nephew 1/2 1/2 0
First cousins 3/4 1/4 0


The distribution of the Z0 and Z1 metrics are plotted for every combination of individuals:

IBD Z0 vs Z1 plot. Samples that have been identified as related are highlighted in red.

IBD Z0 vs Z1 plot. Samples that have been identified as related are highlighted in red.

A total of 8 unique samples have been identified as related and are provided below:

This table provides the IBD results, which may identify the same sample as related to many samples, and therefore the table may have more rows than the number of samples identified as related. NOTE: This is an interactive table where related samples can be explored.

Ancestry Estimation

It is important to account for ancestry during a typical GWAS study. The Data was merged with a subset of the 1000 genome phase 1 data, pruned and PCA calculated using PLINK. The principal components 1 (PC1) and 2 (PC2) were plotted and samples of the 1000 genome are coloured according to their known ancestry. The genotyped samples were then superimposed on this data and are coloured in black to estimate the ancestry of the genotyped samples. The ancestry estimation is shown below using PC1 vs PC2:

NOTE: This is an interactive figure showing the estimated ancestry of each sample. Ethnicities are as follows AFR:African, AMR:Ad Mixed American, EAS: East Asian, EUR:European.

Additional plots of ancestry can be found in the "9.Additional_QC/6.Ancestry_estimation" folder.


Overview of potential problematic samples

A summary of the samples identified as potential problematic are shown in the table below:

NOTE: This is an interactive table where samples can be explored to identify potential issues.


Summary of reasons why a sample has been identified as potentially probelematic

Summary of reasons why a sample has been identified as potentially probelematic


SNP QC

SNP Call Rates

The SNP call rate is the proportion of samples that have an assigned genotype for a particular SNP. SNPs with a low call rate can be caused by a number issues, including both technical and biological. SNPs with a low call rate can lead to potential bias Marees et al. (2018). Problematic SNPs in GenomeStudio were initially assessed and rescued as described in our GenomeStudio QC SOP. The data was then processed using zCall, which attempts to assign genotypes to unassigned calls. The following is a summary of the number of SNPs:

QC SNP count
Genotyping Chip 547644
After GenomeStudio QC 545460
After zCall 545460
* <95% SNP + <98% sample call rate 545019

*The removal of samples with call rate less than 98% was performed for the purpose of additional QC. These samples/SNPs were reintroduced into the final processed data.

Minor Allele Frequency

The minor allele frequency (MAF) is the frequency of the second most common allele for a particular SNP. Most studies are underpowered to detect associations with SNPs with low MAF and therefore are excluded Marees et al. (2018). For the smallest studies, where fewer than 1000 individuals are investigated, a cut-off of 5% should be considered Coleman et al. (2016). * The distribution of MAF is shown below:

Distribution of MAF in the data. The typical MAF 0.01 and 0.05 cut-off thresholds are indicated by the red lines. A MAF cut-off of 0.05 would leave 266678 SNPs, while a MAF cut-off of 0.01 would leave 288817

Distribution of MAF in the data. The typical MAF 0.01 and 0.05 cut-off thresholds are indicated by the red lines. A MAF cut-off of 0.05 would leave 266678 SNPs, while a MAF cut-off of 0.01 would leave 288817


Content Structure

The data has been arranged in the following folders:

  • 0.Scripts_and_logfiles (Contains the master script and associated logfiles)
  • 1.GenomeStudio_report_file (Contains the processed GenomeStudio output file that was used as input for the Genotyping QC Pipeline)
  • 2.Illumina_manifest_and_allele_update_file) (Contains the Illumina manifest and the allele update files in the Illumina TOP strand format)
  • 3.Clinical_gender (Contains the clinical gender information)
  • 4.Prepare_report_file (Contains report file with problematic samples removed)
  • 5.Duplicate_samples (Contains information duplicate IDs identified and any associated changes)
  • 6.Samples_SNPs_removed (Contains list of samples and SNPs removed before applying zCall)
  • 7.zCall (Contains zCall related files)
  • 8.FINAL_QC_DATA (Contains final processed data after zCall. The data is provided in the Illumina TOP strand. If duplicates existed in this data, then the ".fam_with_dup_ID" contains sample IDs that were changed. This data should be used if imputation was not requested.)
  • 9.Additional_QC (Contains additional processing that was performed on the final processed data to identify additional potential problematic issues.)
  • 1.Low_call_rate_SNP_samples_removed (Contains the PLINK binary file when SNPs with a call rate below 95% and samples with call rate below 98% are removed. The "08.zcall_final_low_snp_sample_removed.mindrem.id" file lists the samples removed)
  • 2.Pruned_data (Contains the pruned data)
  • 3.Sex_check (Contains gender check information and plots. The "08.gender_missmatches.txt" file lists all samples with gender mismatches)
  • 4.Heterozygosity_test (Contains heterozygosity test related files and plots. The "08.zcall_final_highLD_and_nonautosomal_removed.het.LD_het_outliers_sample_exclude.txt" file lists all samples identified as outliers)
  • 5.Idenity-by-Descent (Contains IBD related files and plots. The "08.IBD_outliers.txt" file contains samples identified as related)
  • 6.Ancestry_estimation (Contains files and plots relating to ancestry check)
  • 7.Call_rate_plots (Contains sample call rate plots after QC)
  • 8.MAF_plot_and_snp_summary (Contains MAF plots and summary on number of SNPs)
  • 9.Compiled_list_of_potential_problematic_samples (Contains list of samples that have been identified as outliers.)
  • 10.Imputation (This folder will only be available if data imputation was requested and should be used for further analysis.)
  • 1.Pre-imputation_QC (Contains processing files for SNP and sample removal)
  • 2.Pre-imputation_checks_against_ref (Contains files relating to checks with reference sequence - performed using the McCarthy tool)
  • 3.Data_to_impute (Contains files to be be uploaded to the Michigan Server)
  • 4.Raw_imputed_data (Contains the files downloaded from the Michigan Server)
  • 5.Post_imputation_checks (Contains files relating to the post imputation checks - performed using the McCarthy IC tool)
    • 1.IC_input (McCarthy IC tool input files)
    • 2.IC_output(McCarthy IC tool output files)
    • summary_report.html (McCarthy IC tool summary report)

Further Analysis

The data provided has been processed through GenomeStudio and zCall. SNPs that were deemed unreliable in GenomeStudio and samples that had a call rate below 95% were removed. The list of samples removed (if any) can be found in Sample Call Rate and are listed in the "6.Samples_SNPs_removed" folder. The remaining SNPs and samples were further processed using zCall to create the "final processed data" located in the "8.FINAL_QC_DATA" folder. Additional SNPs and samples (SNPs <95% and samples <98%) were only removed to calculate gender mismatches, heterozygosity, related samples, and ancestry. These SNPs and samples have been reintroduced in the "final processed data", but a list of these samples are detailed Overview of potential problematic samples. If performing a typical GWAS study, then these samples should be considered for removal. A very good GWAS guide is Coleman et al. (2016) which also contains a GWAS codebook.


Contacts

For any queries regarding COPILOT, please contact: * Dr Hamel Patel + hamel.patel@kcl.ac.uk


Reference

Buckleton, John S., Jo-Anne Bright, and Duncan Taylor. 2018. Forensic DNA Evidence Interpretation. doi:10.4324/9781315371115.

Coleman, Jonathan R.I., Jack Euesden, Hamel Patel, Amos A Folarin, Stephen Newhouse, and Gerome Breen. 2016. “Quality control, imputation and analysis of genome-wide genotyping data from the Illumina HumanCoreExome microarray.” Briefings in Functional Genomics 15 (4): 298–304. doi:10.1093/bfgp/elv037.

Goldstein, Jacqueline I, Andrew Crenshaw, Jason Carey, George B Grant, Jared Maguire, Menachem Fromer, Colm O’dushlaine, et al. 2012. “zCall: a rare variant caller for array-based genotyping.” BIOINFORMATICS APPLICATIONS 28 (19): 2543–5. doi:10.1093/bioinformatics/bts479.

Laurie, Cathy C., Kimberly F. Doheny, Daniel B. Mirel, Elizabeth W. Pugh, Laura J. Bierut, Tushar Bhangale, Frederick Boehm, et al. 2010. “Quality control and quality assurance in genotypic data for genome-wide association studies.” Genetic Epidemiology 34 (6). NIH Public Access: 591–602. doi:10.1002/gepi.20516.

Marees, Andries T., Hilde de Kluiver, Sven Stringer, Florence Vorspan, Emmanuel Curis, Cynthia Marie-Claire, and Eske M. Derks. 2018. “A tutorial on conducting genome-wide association studies: Quality control and statistical analysis.” International Journal of Methods in Psychiatric Research 27 (2). John Wiley; Sons Ltd. doi:10.1002/mpr.1608.

Turner, Stephen, Loren L. Armstrong, Yuki Bradford, Christopher S. Carlsony, Dana C. Crawford, Andrew T. Crenshaw, Mariza de Andrade, et al. 2011. “Quality control procedures for genome-wide association studies.” Current Protocols in Human Genetics CHAPTER (SUPPL.68). Blackwell Publishing Inc.: Unit1.19. doi:10.1002/0471142905.hg0119s68.