SG-ADVISER will send the user an email once results are available. Please note variant files submitted to the SG-ADVISER server will be destroyed after successful completion of variant annotation. Annotation files will be destroyed 30 days after their generation. Annotations are provided as a tab-delimited flat file. The output file can be manipulated with our UI tool. Download the UI. Variants are annotated at the transcript level and presented as a single line per variant - thus any column containing annotations relevant to multiple transcripts will be further subdivided by triple back slashes ("///"). When an annotation is not applicable to a variant or transcript, the null value is represented by a "-" character - often in the format of the column. For example, a column where entries are formatted as "Value1~Value2", if null, will receive a value of "-~-". Example of a annotated file can be found here. The specific source and format of each column is presented below. The full citation for references listed in this section can be found in References
Haplotype - Placeholder for phased genomes. May contain any user derived value.
Chromosome - Chromosome containing the variant in "chr#" format where # is 1 - 22 or X or Y.
Begin - Physical start position of the variant. 0-based coordinates. Coordinates correspond to hg19.
End - Physical end position of the variant. 0-based coordinates. Coordinates correspond to hg19.
VarType - Variant type (e.g. 'snp', 'ins', 'del', 'delins').
Reference - Allele in the reference genome (hg19).
Allele - Reported alternative allele derived from reference-based variant calling.
Notes - Any free text to be carried over to the annotation file. Examples include genotypes, quality scores, etc.
Gene - The transcript(s) nearest to the variant by physical distance. Gene models are derived from the UCSC genome browser known genes track (Meyer et al.
2012). HUGO gene symbol is provided for each gene followed by the UCSC transcript ID in parenthesis. Format: Gene_Symbol(UCSC_transcript_ID)
Gene_Type - The transcript type. Possible values are "Protein_Coding" or "Noncoding_RNA."
Location - Location of the variant relative to the nearest transcript(s). Exons and introns are numbered in the direction of the reading frame. Multiple nucleotide subsitutions may span multiple locations (e.g. Exon_6-Intron_6). Potential values are "Upstream", "Downstream", "5UTR" (5' untranslated region), "3UTR" (3' untranslated region), Exon_# (where # is the coding exon number), Intron_# (includes introns flanking coding and non-coding exons), and noncoding_rna for variants landing in non-intronic noncoding RNA sites.
Distance - Absolute physical distance from nearest transcript. Shortest distance from transcription start or stop site is calculated. All variants within the transcription start and end site receive a value of "0".
Coding_Impact - The impact of a variant on a protein coding transcript(s). Multiple transcripts are delimited by "///". It is possible for transcripts with the same gene symbol to receive different values. Potential values include:
Protein_Pos - The position(s) of the variant within the amino acid sequence of a protein coding gene.
Original_AA- The reference amino acid(s) at the position(s) impacted by the variant. Presented as single letter IPUAC codes
Allele_AA - The variant amino acid(s) at the position(s) impacted by the variant. Presented as single letter IPUAC codes
Start~Stop_Dist - For variants within a protein-coding reading frame, the physical distance of variants from the start and stop codon. Formatted as: distance_from_start_codon~distance_from_stop_codon.
Columns 18-20 apply to truncating (frameshift or nonsense) variants only. Downstream alternative start sites are considered for these calculations as possibilities - thus the Prop_Cons_Affected Upstream and Downstream do not necessarily sum to 1.0.
Prop_Cons_Affected_Upstream - The fraction of the conserved portion of the protein coding sequence removed by a truncating variant. Note, an amino acid is conserved to be part of a conserved portion if it lies within a conserved element (see below). Each individual amino acid need not be conserved.
Prop_Cons_Affected_Downstream - The fraction of the conserved portion of the protein coding sequence removed by a truncating variant.
Trunc_Prediction - A summary prediction of the functional impact of a truncating variant. If ~>4% of the conserved portion of a protein is removed by a truncating variant, it is considered "Damaging_Truncation." Cutoff derived from Hu and Ng 2012.
ConservedXX - Conserved elements and nucleotide specific conservation levels. "XX" in column header (i.e. 46way) corresponds to the species considered for conservation
analysis. Conserved elements are determined by PhastCons (Siepel et al. 2005). Nucleotide specific conservation level is determined by PhyloP (Pollard et
al. 2010). Score is presented as PhastCons-Score~PhyloP-Score at the same species depth of conservation. It is sufficient for a conserved element to be
identified at a single species depth for a gene region to be considered as part of a conserved element in the above Trunc_Prediction.
XXX_minallele - Allele frequency of HapMap SNPs from HapMap populations (International HapMap Consortium. 2007). Denominator for allele frequency is 1000. 11 populations
included - ASW, CEU, CHB, CHD, GIH, JPT, LWK, MEX, MKK, TSI, YRI.
1000GENOMES_AF - Allele frequency of variants identified in the 1000 Genomes Project (1000 Genomes Project Consortium, 2012). Denominator for allele frequency is 1.00
WELLDERLY_AF325 - Allele frequency of variants identified in the Scripps Wellderly Genomes . Denominator for allele frequency is 1.00
NHLBI - Allele frequency of variants identified in the NHLBI exomes. Denominator for allele frequency is 1.00
eQTL_genes - Genes whose expression is known to be statistically associated with the presence of the variant. Data derived from the NCBI GTEx eQTL Browser .
miRNA_BS_influenced - Applicable if a variant maps to the 3'UTR of a gene. If a variant maps anywhere within the 3'UTR of a transcript, TargetScan (Lewis et al. 2005) is utilized to compare the predicted binding sites of all microRNAs, derived from miRBase (Griffiths-Jones et al. 2006), of the reference and alternate 3'UTR sequence. This comparison results in a prediction of created or destroyed microRNA binding sites. Note the variant does not need to directly fall within a predicted microRNA binding site to appear in this column. The microRNA binding site created or destroyed may appear anywhere within the affected transcripts' 3'UTR. Output is a list of microRNA names. Multiple influenced microRNAs within the same transcript 3'UTR are separated by '~', transcripts separated by '///'.
miRNA_BS_impact - The variant impact on the microRNA listed in miRNA_BS_influenced. Values are CREATED or DESTROYED as predicted by TargetScan. Order corresponds to order as presented in miRNA_BS_influenced.
miRNA_BS_direct - List of microRNAs whose predicted binding site (predicted by TargetScan on the reference 3'UTR) directly houses the variant. The physical location of the microRNA binding site and the variant must directly overlap in this case. Output is a list of microRNA names. Multiple influenced microRNAs within the same transcript 3'UTR are separated by '~', transcripts separated by '///'.
miRNA_BS_deltaG - The calculated variant impact on the binding strength of the microRNA listed in miRNA_BS_direct. Values are the ΔΔG free energy - or ΔG calculated for the reference 3'UTR sequence bound to the microRNA seed region minus ΔG calculated for the variant 3'UTR sequence bound to the microRNA seed region. ΔG values calculated by the Vienna RNA package using RNAcofold (Bernhart et al. 2006).
miRNA_genomic - List of microRNAs whose non-coding pre-miRNA reading frame within the genome houses the variant. Multiple microRNAs are separated by '///'. Note the
different microRNAs listed here have no assumed relationship with the nearest gene nor does the order of presentation have any bearing on the order of
presentation of the nearest transcript.
miRNA_folding_deltaG - ΔΔG change in free energy of folding of the pre-miRNA bearing the variant - or ΔG free energy of the folded reference pre-miRNA sequence minus ΔG free energy of folded variant pre-miRNA sequence. Folding energy is calculated using the Vienna RNA package using the RNAfold algorithm (Hofacker and Stadler 2006). Ordered as presented in miRNA_genomic.
miRNA_binding_deltaG - Average ΔΔG change in free energy of binding for pre-miRNA to its predicted targets - or ΔG free energy of the binding of the reference pre-miRNA sequence to its targets minus ΔG free energy of binding of the variant pre-miRNA to its predicted targets. Binding energy is calculated using the Vienna RNA package using the RNAcofold algorithm. Ordered as presented in miRNA_genomic, separated by "///".
miRNA_top_targets_changed - Top five gene targets with the largest change in ΔΔG free energy of binding and their corresponding ΔΔG values. Values corresponding to different microRNAs separated by "///", genes separated by ",", gene and ΔΔG value separated by "~".
Splice_Site_Pred - Determination as to whether the variant impacts canonical splice site donor or acceptor site of the impacted transcript. Possible values are
Splice_Site_Donor_Damaged, Splice_Site_Acceptor_Damaged, or "-".
Splicing_Prediction(MaxENT) - Prediction as to whether a variant nearby an exon-intron junction influences splicing. Prediction calculated by the MaxEntScan algorithm (Yeo and Burge 2004). Output is presented as Result_of_splice_site_Prediction~reference_maximum_entropy_score&variant_maximum_entropy_score. Potential values are: Unconventional Splice Site (when the MaxENT score of the reference sequence is negative, Splicing_Change (when the MaxENT score of the reference sequence is positive and the variant sequence is negative, or No_Splicing_Change (when both the reference and variant MaxENT scores are positive).
ESE_sites - Total number of exonic splicing enhancer motif sequences CREATED or DELETED by the variant. ESE motifs derived from (Stadler et al. 2006). Format # site(s) CREATED/DELETED where # is the total number of motifs gained (CREATED) or lost (DELETED).
ESS_sites - Total number of exonic splicing silencer motif sequences CREATED or DELETED by the variant. ESS motifs derived from (Stadler et al. 2006). Formatted in the same manner as ESE_sites.
Protein_Impact_Prediction(Polyphen) - Prediction of the functional impact of nonsynonymous variants by the Polyphen-2 algorithm (Adzhubei et al. 2013). Values = probably damaging, possibly
damaging, benign, and unknown.
Protein_Impact_Probability(Polyphen) - Probability that the nonsynonymous variant is deleterious by the Polyphen-2 algorithm.
Protein_Impact_Prediction(SIFT) - Prediction of the functional impact of nonsynonymous variants by the SIFT algorithm (Kumar et al. 2009). Values = TOLERANT, INTOLERANT, and N/A.
Protein_Impact_Score(SIFT) - Conservation score as determined by the SIFT algorithm. Scores < 0.05 are INTOLERANT.
Protein_Domains - Significant PFAM (Finn et al. 2008) protein domains of the impacted protein-coding transcript as determined by InterProScan (Zdobnov and Apweiler 2001).
Transcripts separated by "///" domains in the same transcript separated by "$".
Protein_Domains_Impact(LogRE) - For all variants impacting the coding sequence (nonsynonymous, frameshift, in-frame, etc.) a Log Ratio E-value score is calculated for all protein domains bearing the variant. The LogR.E-value score is the log ratio of the PFAM hidden markov model E-value match of the variant amino acid sequence over the match to the reference amino acid sequence (Clifford et al. 2004). A LogR.E-value greater than 0.7 is predicted damaging. PFAM matching performed by HMMER (Eddy 2009). Presented as PFAM_ID~Score. Transcripts separated by "///" domains in the same transcript separated by "$".
Protein_Impact_Prediction(Condel) - Prediction of the functional impact of nonsynonymous variants by the Condel algorithm (González-Pérez A and López-Bigas N, 2011). Values = deleterious, neutral, N/A.
Protein_Impact_Score(Condel) - Probability that a nonsynonymous variant is deleterious by the Condel algorithm.
TF_Binding_Sites - Predicted transcription factor binding sites impacted by the variant. PLUS or MINUS depicts the DNA strand on which the transcription factor binding site
lies. Separate impacted sites are delimited by "///". TFBS sites are not related to annotated transcripts. Predicted TFBS are pre-computed by utilizing the
human transcription factors listed in the JASPAR and TRANSFAC transcription-factor binding profile to scan the human genome using the MOODS algorithm
(Wasserman and Sandelin 2004, Wingender 1996, Korhonen 2009). The probability that a site corresponds to a TFBS is calculated by MOODS based on the
background distribution of nucleotides in the human genome. TFBS are called at a relaxed threshold within (p-value < 1∙10-6) in conserved,
hypersensitive, or promoter regions, and at a more stringent threshold (p-value < 1∙10-8) for all other locations in order to capture sites that
are more likely to correspond to true functional TFBS. Conserved and hypersensitive sites correspond to the phastCons conserved elements, Encode DNASE
hypersensitive sites annotated in UCSC genome browser, while promoters corresponds to 2kb upstream of known gene transcription start sites, promoter
regions annotated by TRANSPro, and transcription start sites identified by SwitchGear Genomics ENCODE tracks.
TFBS_deltaS - For each TFBS listed above, the potential functional impact of variants on TFBS are scored by calculating the difference between the variant and original sequence scores using the position weighted matrix method described in Stormo 2000 and shown to identify regulatory variants in Andersen et al 2008. A suggested threshold for damaged TFBS is either deleted (TFBS completed removed by a deletion) or TFBS or those with a delta score of less than -7.0.
omimGene_ID~omimGene_association - OMIM gene ID and associated phenotype if any (McKusick 1998). Presented as OMIM_ID~Phenotype. Transcripts delimited by "///".
Protein_Domain_Gene_Ontology - Gene Ontology (Ashburner et al. 2000) annotations assigned to Protein_Domains detected by InterProScan. Transcript specific - i.e. all gene ontology annotations per protein domain in each individual transcript are combined.
dbSNP_ID - dbSNP identifiers for previously observed variants (Sherry et al. 1999).
HGMD_Variant~PubMedID - Determination as to whether the variant has been previously associated with disease and deposited in the Human Gene Mutation Database (HGMD) (Krawczak et
al. 2000). Format: phenotype~PubmedID.
HGMD_Gene~disease_association - List of phenotype associated in HGMD with the gene(s) nearest to the variant. Different phenotypes for the same transcript delimited by "$", transcripts delimited by "///".
Genetic_Association_Database~PubMedID - Determination as to whether the variant is entered as a risk variant in the NCBI Genetic Association Database (Becker et al. 2004) and associated
PubMedID of publication. Format: phenotype~PubMedID. Not transcript specific. Multiple phenotypes separated by "///".
PharmGKB_Database~Drug - Determination if variant is entered as a risk variant in PharmGKB (Thorn et al. 2010) and associated drug whose metabolism/efficacy etc. is affected. Not transcript specific. Multiple phenotypes separated by "///".
Inheritance~Penetrance - Inheritance pattern and penetrance of disease causing variants as curated in GET Evidence . Transcript specific.
Severity~Treatability - Severity and treatability of disease causing variants as curated in GET Evidence. Transcript specific.
COSMIC_Variant~NumSamples - Number of times the variant has been observed in cancer samples as catalogued by COSMIC (Bamford et al. 2004). Not transcript specific. Format:
cancer_type~number_of_observations. Multiple tumor types separated by "$".
COSMIC_Gene~NumSamples - Number of times the gene(s) impacted by the variant have been observed mutated in cancer samples. Transcript specific. Format: cancer_type~number_of_observations. Multiple tumor types separated by "$". Transcripts separated by "///".
MSKCC_CancerGenes - Determination as to whether the impacted gene(s) are considered cancer genes as catalogued by the Memorial Sloan Kettering Cancer Center (Higgins et al 2007). Potential values: Tumor Suppressor or Oncogene.
Atlas_Oncology - Determination as to whether the impacted gene(s) are considered cancer genes as catalogued by Atlas Oncology .
Sanger_CancerGenes - Determination as to whether the impacted gene(s) are considered somatic cancer genes as catalogued by the Sanger Cancer Gene Census (Futreal et al. 2004). Format: cancer type name. Multiple cancer types separated by "$".
Sanger_Germline_CancerGenes - Determination as to whether the impacted gene(s) are considered germline cancer genes as catalogued by the Sanger Cancer Gene Census. Format: cancer type name. Multiple cancer types separated by "$".
Sanger_network-informed_CancerGenes~Pval - Significant cancer genes imputed by network connectivity to known cancer genes. Manuscript under preparation. Format: gene_name~p-value.
SegDup_Region - Determination as to whether the variant falls in a segmental duplication region as annotated by the UCSC genome browser. Format: Seg_Dup or "-". We
suggest caution in interpreting variants falling in segmental duplication regions due to high probability of mismapping of reads.
Gene_Symbol - Repeat of gene symbol for convenience.
DrugBank - DrugBank ID (Wishart et al. 2006) of compounds known to target the impacted gene(s).
Reactome_Pathway - Biological pathway to which the impacted gene(s) belong to as annotated by Reactome (Joshi-Tope et al. 2005). Multiple pathways separated by "~". Transcripts separated by "///".
Gene_Ontology - Gene Ontology (Ashburner et al. 2000) annotations for impacted gene(s). Multiple terms separated by "$". Transcripts separated by "///".
Disease_Ontology - Disease Ontology (Osborne et al. 2009) annotations for impacted gene(s). Multiple
terms separated by "$". Transcripts separated by "///".
ADVISER_Score_Clinical~Disease_Entry~Explanation - Modified American College of Medical Genetics summary categorization for variants residing in genes previously causally associated with disease. See ADVISER Scoring for details.
ADVISER_Score_Research~Disease_Entry~Explanation - Modified American College of Medical Genetics summary categorization for variants residing in genes previously causally associated with disease or associated with elevated disease risk. See ADVISER Scoring for details.