COPILOT: Containerised Workflow for Processing Illumina Genotyping Data

Genotyping microarrays remain a popular tool to explore genetic variants such as single nucleotide polymorphism (SNP) and large structural changes in DNA. The Illumina genotyping arrays accomplish this by using pre-defined oligonucleotide probes designed to hybridise specific regions of genomic DNA, followed by extension using chemically labelled nucleotides. The probe extension binds either red or green fluorescent agents, which can be interpreted by the Illumina specific software GenomeStudio. This software determines the identity of alleles by automated clustering of samples based on the similarity of fluorescent intensity. However, the default clustering algorithm can fail to identity valid clusters and can also assign the wrong genotype to samples due to abnormal intensity patterns. This can be addressed by manually reviewing and recalling of SNPs to increase the reliability, confidence and overall quality of the data (including SNP/sample call rates), making this an extremely crucial quality control (QC) procedure prior to further QC using PLINK or genetic interpretation.

Furthermore, GenomeStudio’s GenCall algorithm is optimised for clustering and calling common variants, and often reports rare variants as a No Call. The data is then generally processed by a series of complex bioinformatics analyses, which depend on various software’s, different programming languages, and their dependencies to be installed and configured correctly. This can be a daunting task for novice bioinformaticians.

Here we introduce COPILOT, a Containerised wOrkflow for Processing ILlumina genOtyping daTa. The automated pipeline consists of a series of bash, C/C++, R, and python programs containerised using the docker framework, ready to execute on multiple operating platforms with minimal effort from the user. The pipeline will take the output from GenomeStudio, process the data using the zCall rare variant calling algorithm, and apply a number of analyses including identification of any genotypic and phenotypic gender discrepancies, calculation of Identity-by-descent (IBD), perform heterozygosity testing and estimate sample ancestry based on the 1000 genome reference panel. The output from the pipeline includes the processed data accompanied by a detailed interactive summary report with informative plots and explanations to aid the user for the next step in analysis.

We include a thorough GenomeStudio QC guide which can be used to process raw genotype data in GenomeStudio and create the required data format required for COPILOT. We also include a detailed user guide to execute COPILOT, complete with real genotype example data, and an example of the output generated by COPILOT.