SIFT

I.	INSTALLATION

	A.	Getting the executables
		In bin/ , there are two folders, one named "linux" and one
		named "solaris".
		Go to the folder that corresponds to your desktop environment
		('cd solaris' or 'cd linux') and type:
 
		mv * ..
 
		This moves all executables in the solaris (linux) folder to
		the parent directory, so that all of the executables are now
		in sift/bin/
 
		If you have neither a solaris nor linux platform, go to src/
		and follow README directions to compile the source code.  
		Once compilation is complete, move the executables to 
		sift/bin. 
                                                                          
	B.	Setting paths
	
		In bin/SIFT.csh, set the paths for  
			1) NCBI -- blastpgp and formatdb should 
				   be in this folder 
			2) SIFT_DIR - all SIFT executables should be in
			   SIFT_DIR/bin

			3) BLIMPS_DIR - where the blimps directory is 


II.	DATABASE FORMAT
	This step is needed is optional if you are inputting a protein alignment
	 or a NCBI gi id.
	If you are submitting either of these, you can skip to III.B or III.C.
  
	SIFT searches a database of protein sequences to find homologous 
	sequences.  You will need to download a database of protein sequences
	and format it properly so that SIFT subroutines can read it.

	A.	Database from NCBI: 
		Make sure that the gi number is listed first, 
		i.e. gi|1234567|.... 

	*******		OR 	********

	B.	Database from EMBL:
		ftp://ftp.ebi.ac.uk/pub/databases/fastafiles/uniprot/	

		To do this, create the sed script titled "sed.in"

		Paste the 3 lines below into sed.in  

		/_/{
		        s/ (/|/
        		s/) / /
		}

	then run your database through.  For example,

	gzcat uniprotkb_trembl.gz | sed -f sed.in > swiss.uni
	$NCBI/formatdb -i swiss.uni -t 'Uniprot-TrEMBL <version>'

	The names will be changed to the proper format for proper parsing.

	If you have your own protein sequence database and SIFT is not properly
	recognizing the names, go to src/Alignment.c and modify "fix_names"
	Recompile ALL programs.
 
III.	RUNNING SIFT

	A.  Input: Protein sequence. (SIFT chooses homologues).
		Requires 3 inputs:
		1) Protein sequence in fasta format.
		2) Protein database to search.  These sequences are 
		   assumed to be functional
		3) File of substitutions to be predicted on (optional).  
		   See test/lacI.subst for an example of the format. 
		   This file is optional. Alternatively, you can print 
	           scores for the entire protein sequence. 

		Results will be stored in the tmp/<seq_file>.SIFTprediction.

		COMMANDLINE FOR A LIST OF SUBSTITUTIONS:
		If you are in SIFT_DIR, the commandline is: 

		bin/SIFT_for_submitting*.csh <seq file> <protein_database> <file of substitutions> <median conservation- 2.75 recommended>

		EXAMPLE:
		If you have a list of substitutions, type the following:
		bin/SIFT.csh test/lacI.fasta <protein_database> test/lacI.subst 2.75

		Results will appear in lacI.fasta.SIFTprediction and 
		look something like:

		K2S     TOLERATED 0.08 3.47 LOW CONFIDENCE
		P3M     TOLERATED 0.08  3.35 LOW CONFIDENCE
		V15K    INTOLERANT 0.00 2.84

		According to this output, the SIFT score for K2S is 0.08
		and the median information of the sequences that have an 
		amino acid represented at the position 2 is 3.47.  If this
		number exceedes 3.25 the substitution is annotated as 
		having LOW CONFIDENCE (which means too few sequences were
		represented at that position.)  There are enough sequences for
		confidence in the V15K prediction. 

		COMMANDLINE TO PRINT ALL SIFT SCORES
		bin/SIFT.csh <seq file> <protein_database> - <median conservation - 2.75 recommended>
		A dash "-" replaces the list of substitutions.
		Results will appear in lacI.fasta.SIFT prediction.  Each row is a position
		in the sequence (row 1 is amino acid position 1, row 2 is amino acid 2) and
	        the SIFT scores for each amino acid substitution are printed for each row.


	B.	Input: Your own protein alignment (and the path to the environmental variable BLIMPS_DIR, which was set in SIFT.csh).

		COMMANDLINE FORMAT FOR A LIST OF SUBSTITIONS:
		If you are in SIFT_DIR, the commandline is:
		env BLIMPS_DIR=<blimps_path> bin/info_on_seqs <protein alignment> <substitution file> <output file>

		EXAMPLE:
		Type in:
		env BLIMPS_DIR=<blimps_path> bin/info_on_seqs test/lacI.alignedfasta test/lacI.subst test/lacI.fasta.SIFTprediction

		And the prediction results will appear in 
		test/lacI.fasta.SIFTprediction, read above for description 
		of output.

		COMMANDLINE TO PRINT ALL SIFT SCORES:
		env BLIMPS_DIR=<blimps_path> bin/info_on_seqs <protein alignment> - <output file>

		Example:
		Type in:
		env BLIMPS_DIR=<blimps_path> bin/info_on_seqs test/lacI.alignedfasta - test/lacI.fasta.SIFTprediction

		and scores for each position will appear in the file.  Read above for
		description of the output.

	C.      Input: BLink gi #. Requires 1 input, the gi #.
		
		csh bin/SIFT_for_submitting_NCBI_gi_id.csh <gi id> <subst_file> BEST

                1) <gi id> : the NCBI protein gi #.  SIFT will retrieve precomputed 
		   BLAST hits from NCBI based on this gi ID. 
                2) File of substitutions to be predicted on (optional).
                   See test/lacI.subst for an example of the format.
                   This file is optional. Alternatively, you can print
                   scores for the entire protein sequence by entering a "-".
		3) Type of hits to retrieve from BLink (optional).  The two options
		   are BEST or ALL.  By default,ALL hits are retrived. To get
	           reciprocal best hits , pass in "BEST".  
                
		Results will be stored in the tmp/<gi #>.SIFTprediction

                COMMANDLINE FOR A LIST OF SUBSTITUTIONS:
                If you are in SIFT_DIR, the commandline is:

                csh bin/SIFT_for_submitting_NCBI_gi_id.csh <gi id> <file of substitutions> <BEST or ALL> 

                EXAMPLE:
                If you have a list of substitutions, type the following:
                csh bin/SIFT_for_submitting_NCBI_gi_id.csh 22209009 test/gi22209009.subst BEST 

                Results will appear in $tmpdir/22209009.SIFTprediction and
                look something like:

		Q10M    TOLERATED       0.12    2.71    22      98
		Q11C    DELETERIOUS     0.04    2.74    23      98

		along with some warnings.

		Read III.A for description of output.

                COMMANDLINE TO PRINT ALL SIFT SCORES
               	csh bin/SIFT_for_submitting_NCBI_gi_id.csh <gi id> - BEST 

		A dash "-" replaces the substitution file, and BEST is optional.
                Results will appear in <gi # id>.SIFT prediction.
		Read III.A for description of output.

UPDATE TO SIFT 4.0
------------------

SIFT 4.0.2 includes four new tools to enable exome-wide analysis of single nucleotide variants and indels.

SIFT_exome_nssnvs.pl
SIFT_exome_indels.pl
SNPClassifier
SIFT_intersect_cds.pl

(for SNPClassifier, see the documentation in bin/SNPClassifier/ directory)

0. SIFT_intersect_cds.pl pre-filters variants from the whole genome to coding variants only.  Run this only if you want to ignore noncoding variants, and time and space is an issue.  Otherwise, proceed to steps 1 & 2, as we provide a little annotation for noncoding variants.

0a. Setting up:

The file containing coding exon coordinates should be downloaded from the ftp location
ftp://ftp.jcvi.org/pub/data/sift/Coding_info_36/ens.hum.ncbi36.ver41.cds.merge.gff 
-or-
ftp://ftp.jcvi.org/pub/data/sift/Coding_info_37/ens.hum.ncbi37.ver55.cds.merge.gff

We recommend you put these files in your $SIFT_HOME/coding_info/Homo_sapien_version folder

Also, set the bin path in IntersectLocations.sh to the appropriate SIFT path 
 
0b. Preparing the input

Input Format Example : RESIDUE BASED COORDINATE SYSTEM (comma separated) 3,81780820,-1,T/C
2,43881517,1,A/T,#User Comment
2,43857514,1,T/C
6,88375602,1,G/A,#User Comment
22,29307353,-1,T/A
10,115912482,-1,C/T

Format Example 2: SPACE BASED COORDINATE SYSTEM (comma separated) 3,81780819,81780820,-1,T/C
2,43881516,43881517,1,A/T,#User Comment
2,43857513,43857514,1,T/C
6,88375601,88375602,1,G/A,#User Comment
22,29307352,29307353,-1,T/A
10,115912481,115912482,-1,C/T

An example input file is provided: SIFT_HOME/test/snvs.input
Format Description [comma separated: chromosome,coordinate,oientation,alleles,user comment(optional) ]
Please do not use spaces except in the user comments field

Coordinate System:
SIFT accepts both reidue-based and a space-based coordinates for single nucleotide variants.
If there is only one column of coordinates, as shown in Example 1 above, SIFT assumes the coordinate
system is residue-based, if there are two columns, as shown in Example 2 above, SIFT assumes the
coordinate system is space-based.

The space-based coordinate system counts the spaces before and after bases rather than the bases themselves.
Zero always refers to the space before the first base.

0c. Running the tool

cd to SIFT_home/bin directory and edit SIFT_intersect_cds.pl to change the line
$ENV{'SIFT_HOME'} = '/usr/local/projects/SIFT/sift4.0/';
to
$ENV{'SIFT_HOME'} = '<YOUR SIFT_HOME_PATH>';

Following is the usage of SIFT_intersect_cds.pl

usage:
./SIFT_intersect_cds.pl
        -i <List of variants with complete path>
        -c <File with coding coordinates in gff format>
        -o <Optional: output file with complete path - default=/usr/local/projects/SIFT/sift4.0//tmp>

To run the example input provided in the SIFT_HOME/test directory,

./SIFT_intersect_cds.pl -i  ../test/snvs.input  -c $SIFT_HOME/coding_info/Homo_sapien_version/ens.hum.<VERSION>.cds.merge.gff
This program can process millions of variants in < 1 minute.

1. SIFT_exome_nssnvs.pl script takes as input, a list of multiple chromosome coordinates of coding
single nucleotide variants and outputs variant annotation along with SIFT predictions and scores.
This tool requires human variation databases built using SQLite3 that need to be downloaded before
the tool can be used. 

NOTE: This tool is also available on the SIFT website at
http://sift.jcvi.org/www/SIFT_chr_coords_submit.html

1a. Setting up:

Human variation databases should be downloaded from the ftp location 
ftp://ftp.jcvi.org/pub/data/sift/Human_db_36/ , unzipped and placed in 
the directory SIFT_HOME/db/Human_db_version/

Please see ftp://ftp.jcvi.org/pub/data/sift/Human_db_36/README (using your 
browser) for more information about downloading and using the databases in
standalone mode.

1b. Preparing the input

Input Format Example : RESIDUE BASED COORDINATE SYSTEM (comma separated) 3,81780820,-1,T/C
2,43881517,1,A/T,#User Comment
2,43857514,1,T/C
6,88375602,1,G/A,#User Comment
22,29307353,-1,T/A
10,115912482,-1,C/T

Format Example 2: SPACE BASED COORDINATE SYSTEM (comma separated) 3,81780819,81780820,-1,T/C
2,43881516,43881517,1,A/T,#User Comment
2,43857513,43857514,1,T/C
6,88375601,88375602,1,G/A,#User Comment
22,29307352,29307353,-1,T/A
10,115912481,115912482,-1,C/T

An example input file is provided: SIFT_HOME/test/snvs.input
Format Description [comma separated: chromosome,coordinate,oientation,alleles,user comment(optional) ]
Please do not use spaces except in the user comments field

Coordinate System:
SIFT accepts both reidue-based and a space-based coordinates for single nucleotide variants.
If there is only one column of coordinates, as shown in Example 1 above, SIFT assumes the coordinate
system is residue-based, if there are two columns, as shown in Example 2 above, SIFT assumes the
coordinate system is space-based.

The space-based coordinate system counts the spaces before and after bases rather than the bases themselves.
Zero always refers to the space before the first base.

The sequence 'ACGT' has coordinates (0,4) and its subsequence 'CG' has coordinates (1,3) as shown in Example 3 below.
The difference between the start and end coordinates gives the sequence length. Misinterpretation of these
coordinates can easily lead to 'off-by-one'. errors. Space-based coordinates become necessary when describing
insertions/deletions and genomic rearrangements.

Example 3:

0 A 1 C 2 G 3 T 4

In a residue based system as described in Example 4 below, each base is assigned a coordinate base on its
absolute position, starting from 1. The sequence 'ACGT' has coordinates (1,4) and its subsequence 'CG' has
coordinates (2,3).

Example 4:
A C G T
1 2 3 4


Orientation:
Use 1 for positve strand and -1 for negative strand. If orientation is not known, use 1 as default.

Alleles:
Use 'base1/base2' where either base1 or base2 may be the reference allele. SIFT will predict for non-reference
allele only. If you need prediction for reference allele, then use base1/base1 where base1 is the reference allele.

1c. Running the tool

cd to SIFT_home/bin directory and edit SIFT_exome_nssnvs.pl to change the line
$ENV{'SIFT_HOME'} = '/usr/local/projects/SIFT/sift4.0.2/';
to
$ENV{'SIFT_HOME'} = '<YOUR SIFT_HOME_PATH>';

Following is the usage of SIFT_exome_nssnvs.pl

usage: 
./SIFT_exome_nssnvs.pl 
        -i <Query SNP filename with complete path>
        -d <Variation db directory path>
        -o <Optional: output file with complete path - default=/usr/local/projects/SIFT/sift4.0//tmp>

To run the example input provided in the SIFT_HOME/test directory, 

./SIFT_exome_nssnvs.pl -i ../test/snvs.input -d <SIFT_HOME>/db/Human_db_36/ 
Your input data has been recognized to use SPACE based coordinate system. Your job id is 30072 and is currently running.  Your job has been parti
tioned into datasets of 1000 positions and the status of each job can be viewed /usr/local/projects/SIFT/sift4.0//tmp/30072/30072.outpage.txt. On
ce the status of a job is 'Complete', you may view the results. A partitioned job with 1000 input rows typically takes 6-7 min to complete.

The output directory is SIFT_HOME/tmp/PID by default and PID in the above case is 30072

The status of long running jobs (> 5000 input rows) may be viewed at <OUTPUT_DIR>/PID.outpage.txt

2. SIFT_exome_indels.pl script takes as input, a list of multiple chromosome coordinates of coding
insertion/deletion variants and outputs variant annotation. SIFT scores and predictions are not provided
at this stage. This tool requires human coding information files that need to be downloaded before
the tool can be used.

NOTE: This tool is also available on SIFT website at 
http://sift.jcvi.org/www/SIFT_chr_coords_indels_submit.html

2a. Setting up: 

Human coding information files should be downloaded from the ftp location
ftp://ftp.jcvi.org/pub/data/sift/Coding_info_36/ , unzipped and placed in
the directory SIFT_HOME/coding_info/Homo_sapien_version/

Please see ftp://ftp.jcvi.org/pub/data/sift/Coding_info_36/README (using your
browser) for more information about downloading and using the files with this
tool or with SNPClassifier.

Also, set the bin path in IntersectLocations.sh to the appropriate SIFT path

2b. Preparing the input

Format Example: SPACE BASED COORDINATE SYSTEM (comma separated) 10,102760304,102760304,1,GCGGCT,#User comment 1
10,50205013,50205013,1,ACACACACACAC
5,179134934,179134935,1,/,#User comment 2
1,153108866,153108866,1,CTGCTGCTGCTG
11,6368547,6368547,1,GCTGGCGCTGGC
11,65081932,65081932,1,AGCAGC
12,110521161,110521164,1,/
12,116990733,116990736,1,/
12,123453048,123453048,1,CTG
12,131113090,131113090,1,GCA
12,1932613,1932613,1,CTG


Format Description [comma separated: chromosome,coordinate,oientation,alleles,user comment(optional) ]
Please do not use spaces except in the user comments field

Coordinate System:
SIFT accepts only space-based coordinates for insertion / deletion variants.
The space-based coordinate system counts the spaces before and after bases rather than the bases themselves.
Zero always refers to the space before the first base.

The sequence 'ACGT' has coordinates (0,4) and its subsequence 'CG' has coordinates (1,3) as shown in Example 1 below.
The difference between the start and end coordinates gives the sequence length. Misinterpretation of these
coordinates can easily lead to 'off-by-one' errors. Space-based coordinates become necessary when describing
insertions/deletions and genomic rearrangements.

Example 1:

0 A 1 C 2 G 3 T 4
Orientation:
Use 1 for positive strand and -1 for negative strand. If orientation is not known, use 1 as default.

Alleles:
For Insertion, the begin and end coordinates should be same and the allele should be a string of inserted nucleotides in one of the following for
mats.
1. ----/ATGC
2. -/ATGC
3. ATGC

For Deletion, the difference between begin and end coordinates should be equal to the length of the deleted string. the allele can either be left
 blank or be specified in one of the followig formats
1. ATGC/----
2. ATGC/-
3. /

2c. Running the tool

cd to SIFT_home/bin directory and edit SIFT_exome_indels.pl to change the line
$ENV{'SIFT_HOME'} = '/usr/local/projects/SIFT/sift4.0/';
to
$ENV{'SIFT_HOME'} = '<YOUR SIFT_HOME_PATH>';

Following is the usage of SIFT_exome_nssnvs.pl
usage: 
./SIFT_exome_indels.pl 
        -i <Query indels filename with complete path>
        -c <coding info directory path>
        -d <Variation db directory path>
        -o <Optional: output file with complete path - default=SIFT_HOME/tmp>

        All values should be in local 0 space based coordinates.

To run the example input provided in the SIFT_HOME/test directory,
./SIFT_exome_nssnvs.pl -i ../test/indels.input -c <SIFT_HOME>/coding_info/Homo_sapien_36/ -d <SIFT_HOME>/db/Human_db_36/ 

The default output directory is SIFT_HOME/tmp/PID.

2d. Description of output 

(This can also be viewed on the SIFT website at http://sift.jcvi.org/www/chr_coords_example_indels.html)

Amino Acid Position Change

This column contains the change coordinates within the original protein sequence and the modified 
protein sequence. For example, the insertion of GCGGCT at location 102760304 of chromosome 10 of 
Homo Sapiens (represented by input row: 0,102760304,102760304,1,GCGGCT) inserts two additional 
amino acids Arginine 'R' and Serine 'S' at position 145 to 147 (space based coordinates) in the 
modified protein sequence. 

 
>ENST00000238965; MISMATCH = 145-145
GPQEQGSPASCFETSPAGHATQASPYHPRACRGGFYLLPVNGFPEEEDNGELRERLGALK
VSPSASAPRHPHKGIPPLQDVPVDAFTPLRIACTPPPQLPPVAPRPLRPNWLLTEPLSRE
HPPQSQIRGRAQSRSRSRSRSRSRSSRGQGKSPGRRSPSPVPTPAPSMTNGRYHKPRKAR
PPLPRPLDGEAAKVGAKQGPSESGTEGTAKEAAMKNPSGELKTVTLSKMKQSLGISISGG
IESKVQPMVKIEKIFPGGAAFLSGALQAGFELVAVDGENLEQVTHQRAVDTIRRAYRNKA
REPMELVVRVPGPSPRPSPSDSSALTDGGLPADHLPAHQPLDAAPVPAHWLPEPPTNPQT
PPTDARLLQPTPSPAPSPALQTPDSKPAPSPRIP
 
>ENST00000238965; MISMATCH = 145-147
GPQEQGSPASCFETSPAGHATQASPYHPRACRGGFYLLPVNGFPEEEDNGELRERLGALK
VSPSASAPRHPHKGIPPLQDVPVDAFTPLRIACTPPPQLPPVAPRPLRPNWLLTEPLSRE
HPPQSQIRGRAQSRSRSRSRSRSRSrsSRGQGKSPGRRSPSPVPTPAPSMTNGRYHKPRK
ARPPLPRPLDGEAAKVGAKQGPSESGTEGTAKEAAMKNPSGELKTVTLSKMKQSLGISIS
GGIESKVQPMVKIEKIFPGGAAFLSGALQAGFELVAVDGENLEQVTHQRAVDTIRRAYRN
KAREPMELVVRVPGPSPRPSPSDSSALTDGGLPADHLPAHQPLDAAPVPAHWLPEPPTNP
QTPPTDARLLQPTPSPAPSPALQTPDSKPAPSPRIP


Indel location

This percentage indicates the approximate location of the indel in the protein. For example, 
a value less than 50% means that the indel is located in the first half of the protein and is 
close to the amino terminus, whereas a number greater than 50% means that the indel is closer 
to the carboxy terminus.

Transcript Visualization

<---{}--{}[]--[*.]--[]--[]--[]--[]--[]--[]--[]{}---|

The above example visualization mimics the structure of the transcript containing the indel.

<--- indicates the 3' end
---| indicates the 5' end
{}   indicate UTR 
[]   indicates a coding exon
--   indicats an intron
.    indicates the start of insertion or deletion
*    indicates the end of deletion

If the 3'end of the transcript appears to the left of the 5' end, as in this case, then the 
transcript is transcribed from the negative strand. This transcript has two 3'UTRs, one 5'UTR, 
nine exons and nine introns. The indel both starts and ends in the 8th coding exon.


Nucleotide change

The input allele (insertion or deletion) and +/- 5 base pairs are shown. For example,
the user input for insertion variant "10,102760304,102760304,1,GCGGCT" will populate 
this column with the following information
cggct-GCGGCT-acggc

whereas a user input for deletion variant "12,110521161,110521164,1,/" will populate 
this column with the following information
TGCTG-ctg-TTGCT

For insertions, the inserted bases are displayed in uppercase and the flanking bases are 
displayed in lowercase. For deletions, the deleted bases are displayed in lowercase whereas 
the flanking bases are displayed in uppercase.


Amino acid change

This column displays the amino acid change caused by the indel. For example
QQTT->QQqTT indicates the addition of amino acid Glutamine ('Q') in the modified protein sequence,
whereas EEeDA->EEDA indicates the deletion of amino acid Glutamic acid, 'E' in the 
modified protein sequence.


Protein sequence change

This column links  original and modified protein sequence files with regions of mismatch (caused due to indel) 
colored in red. For example, an insertion represented by the user input 
"1,153108866,153108866,1,CTGCTGCTGCTG" 
causes an expansion in polyglutamine tract as shown in the following fasta format sequences. 
The Fasta headers contain the Ensembl transcript ID along with the coordinates of change.

 
>ENST00000271915; MISMATCH = 80-80
MDTSGHFHDSGVGDLDEDPKCPCPSSGDEQQQQQQQQQQQQPPPPAPPAAPQQPLGPSLQ
PQPPQLQQQQQQQQQQQQQQPPHPLSQLAQLQSQPVHPGLLHSSPTAFRAPPSSNSTAIL
HPSSRQGSQLNLNDHLLGHSPSSTATSGPGGGSRHRQASPLVHRRDSNPFTEIAMSSCKY
SGGVMKPLSRLSASRRNLIEAETEGQPLQLFSPSNPPEIVISSREDNHAHQTLLHHPNAT
HNHQHAGTTASSTTFPKANKRKNQNIGYKLGHRRALFEKRKRLSDYALIFGMFGIVVMVI
ETELSWGLYSKDSMFSLALKCLISLSTIILLGLIIAYHTREVQLFVIDNGADDWRIAMTY
ERILYISLEMLVCAIHPIPGEYKFFWTARLAFSYTPSRAEADVDIILSIPMFLRLYLIAR
VMLLHSKLFTDASSRSIGALNKINFNTRFVMKTLMTICPGTVLLVFSISLWIIAAWTVRV
CERYHDQQDVTSNFLGAMWLISITFLSIGYGDMVPHTYCGKGVCLLTGIMGAGCTALVVA
VVARKLELTKAEKHVHNFMMDTQLTKRIKNAAANVLRETWLIYKHTKLLKKIDHAKVRKH
QRKFLQAIHQLRSVKMEQRKLSDQANTLVDLSKMQNVMYDLITELNDRSEDLEKQIGSLE
SKLEHLTASFNSLPLLIADTLRQQQQQLLSAIIEARGVSVAVGTTHTPISDSPIGVSSTS
FPTPYTSSSSC
 
>ENST00000271915; MISMATCH = 80-84
MDTSGHFHDSGVGDLDEDPKCPCPSSGDEQQQQQQQQQQQQPPPPAPPAAPQQPLGPSLQ
PQPPQLQQQQQQQQQQQQQQqqqqPPHPLSQLAQLQSQPVHPGLLHSSPTAFRAPPSSNS
TAILHPSSRQGSQLNLNDHLLGHSPSSTATSGPGGGSRHRQASPLVHRRDSNPFTEIAMS
SCKYSGGVMKPLSRLSASRRNLIEAETEGQPLQLFSPSNPPEIVISSREDNHAHQTLLHH
PNATHNHQHAGTTASSTTFPKANKRKNQNIGYKLGHRRALFEKRKRLSDYALIFGMFGIV
VMVIETELSWGLYSKDSMFSLALKCLISLSTIILLGLIIAYHTREVQLFVIDNGADDWRI
AMTYERILYISLEMLVCAIHPIPGEYKFFWTARLAFSYTPSRAEADVDIILSIPMFLRLY
LIARVMLLHSKLFTDASSRSIGALNKINFNTRFVMKTLMTICPGTVLLVFSISLWIIAAW
TVRVCERYHDQQDVTSNFLGAMWLISITFLSIGYGDMVPHTYCGKGVCLLTGIMGAGCTA
LVVAVVARKLELTKAEKHVHNFMMDTQLTKRIKNAAANVLRETWLIYKHTKLLKKIDHAK
VRKHQRKFLQAIHQLRSVKMEQRKLSDQANTLVDLSKMQNVMYDLITELNDRSEDLEKQI
GSLESKLEHLTASFNSLPLLIADTLRQQQQQLLSAIIEARGVSVAVGTTHTPISDSPIGV
SSTSFPTPYTSSSSC


Causes Nonsense Mediated Decay

Nonsense mediated decay (NMD) is a cellular mechanism of mRNA surveillance to detect 
nonsense mutations and prevent the expression of truncated or erroneous proteins.
This column indicates whether the input indel is likely to cause NMD. If NMD occurs, 
then the indel is equivalent to a gene deletion because the mRNA is never translated.

There is no NMD when:
1) the resulting premature termination codon is in the last exon
-or-
2) the resulting premature termintion codon is in the last 50 nucleotides in the second to last exon

Repeat detected

This column gets populated if the input insertion/deletion is found to expand or contract a 
coding repeat region. For example, an input row '1,153108866,153108866,1,CTGCTGCTGCTG' causes 
an insertion resulting in the expansion of a poly-glutamine repeat. A poly-glutamine repeat of 
length 14 that expands to length 18 is illustrated in this column by 'PQL(q)14P-->PQL(q)18P'. 
The repeat amino acid(s) are shown in parenthesis followed by the repeat number and bounded 
by flanking amino acids.

Warning: NCBI reference miscall

If you receive a reference miscall warning in the coordinates column (first column) of the output 
table, this means that your input coordinates overlap or contain a location that is not a true indel, 
but likely to be an error in NCBI human genome reference sequence.  
