Mapping mutations in plant genomes with the user-friendly web application CandiSNP

Background Analysis of mutants isolated from forward-genetic screens has revealed key components of several plant signalling pathways. Mapping mutations by position, either using classical methods or whole genome high-throughput sequencing (HTS), largely relies on the analysis of genome-wide polymorphisms in F2 recombinant populations. Combining bulk segregant analysis with HTS has accelerated the identification of causative mutations and has been widely adopted in many research programmes. A major advantage of HTS is the ability to perform bulk segregant analysis after back-crossing to the parental line rather than out-crossing to a polymorphic ecotype, which reduces genetic complexity and avoids issues with phenotype penetrance in different ecotypes. Plotting the positions of homozygous polymorphisms in a mutant genome identifies areas of low recombination and is an effective way to detect molecular linkage to a phenotype of interest. Results We describe the use of single nucleotide polymorphism (SNP) density plots as a mapping strategy to identify and refine chromosomal positions of causative mutations from screened plant populations. We developed a web application called CandiSNP that generates density plots from user-provided SNP data obtained from HTS. Candidate causative mutations, defined as SNPs causing non-synonymous changes in annotated coding regions are highlighted on the plots and listed in a table. We use data generated from a recent mutant screen in the model plant Arabidopsis thaliana as proof-of-concept for the validity of our tool. Conclusions CandiSNP is a user-friendly application that will aid in novel discoveries from forward-genetic mutant screens. It is particularly useful for analysing HTS data from bulked back-crossed mutants, which contain fewer polymorphisms than data generated from out-crosses. The web-application is freely available online at http://candisnp.tsl.ac.uk. Electronic supplementary material The online version of this article (doi:10.1186/s13007-014-0041-7) contains supplementary material, which is available to authorized users.

The number of elements in this set is referred to as |S tot |. The number of SNPs in the set (i.e in a mob mutant) that can be called in an NGS experiment is C, the number of actual SNPs multiplied by the probability of calling a SNP (i.e the accuracy of the NGS experiment).

C = |S
tot | ⇥ p(calling a SNP in an NGS experiment) (2) For completeness the number of SNPs missed is What power is gained (or lost) by having more mutants?
The screen in our hands was empowered somewhat by having more than one mutant, allowing us to delete SNPs that were called in more than one experiment and are therefore not unique and, presumably not those introduced by the mutagenesis. SNP analysis pipelines are not absolutely accurate and not all SNPs present in the genomes will be called in a given SNP analysis. Ommisions and false SNP calls can occur due to issues with sequence coverage, errors in the sequence reads, errors by alignment programs or other bioinformatics issues. If we consider a perfect experiment in which we don't miss any SNPs then the increase in power from having more mutants is the proportion that we can reduce the non-mutant related SNPs in each. This is just the overlap between the two sets of SNPs, labelled earlier as S ref and S parent and is on the order of 1700 [1,2] (S ref + S parent , see Additional File 1) . As the Arabidopsis genome is 130 millions of nucleotides long then the chance of two independent SNP lists overlapping is clearly very small, and the worst that could happen is you get 1700 extra SNPs that aren't related to ones in other mutants. Of course, these mutants are very closely related so they are not independent and the other extreme case is that all the SNPs would be shared. Two important factors are the amount to which they are related and the ability to call them correctly, the last quantity was noted earlier as C. Let's consider the extreme case where the S ref SNPs are completely related between the two mutants.
The proportion of SNPs missed the first time round m is basically 1 minus the probability we will pick up the SNP m = 1 P rob(calling a SNP in an NGS experiment) (4) In the second round with another mutant, then the new proportion missed, m 2 is the proportion we missed in the first round, m, minus the proportion that we get this time. The proportion missed overall is the di↵erence between those missed in the first mutant, minus those picked up in the second mutant. So if we missed 0.05 the first time, and call SNPs with 0.95 accuracy, then the maximum proportion that can be picked up is 0.05 ⇥ 0.95. Putting these together, and for further mutants, its just repetition of the same pattern. So for a third mutant the number of SNPs missed m So now we see a pattern emerging for the expression given certain numbers of mutants. If we want to check a little further it makes sense to replace that long term with a short alternative p p = P rob(calling a SNP in an NGS experiment) And we redo the whole thing with p. Here we'll rename m to keep it meaning literally the proportion missed, and introduce q which is 1 p. p = P rob(calling a SNP in an NGS experiment) Then with a little algebra, we can simplify each term by taking out the factor q m 1 = q pq = q(1 p) So the pattern again is that we just end up adding a pq to the end, so that for any number (a) of mutants, the proportion missed m a is Note that this is for the extreme case, where the SNPs not due to the mutagenesis are completely shared. We can now look at how not sharing complete sets of nonmutagenesis induced SNPs a↵ects this proportion, a situation like that when we sequence bulks, for example.

What if the SNP sets are not completely shared?
One assumption of the model so far is that the natural SNPs are completely shared due to the plants being closely related. Its worth noting that when the SNPs don't overlap we can't remove them, that is they're disjoint sets. We can model the unique parts of each set by simply adding back in the proportion that can't be reached o that represents the non-overlapping proportion (so is a number between 0 and 1) we get a new, but very similar, model.
We could be more clever with o, for example making it a function rather than a constant, so perhaps making it the result of a sum calculating the probability of overlap given the numbers, but for these purposes the proportion should su ce.
So the model shows that with a pipeline that has a non-negligible chance of missing SNPs then adding extra mutants works for maybe one or two more, but doesn't really make a di↵erence after that.
Visualising this relationship Naturally, one would want to see the relationship graphed. There are just two variables: the number of mutants and the proportion of SNPs not called (the accuracy of the SNP calling pipeline, not the same as the number of SNPs called that are accurate). Lets first look at a range of figures for the accuracy of the SNP calling pipeline from 0.99 to 0.5, for 1 to 5 mutants with complete overlap in the SNPs. Here's how the graph looks ( Figure 1) The basic pattern is that adding extra mutants (more than 3, say) doesn't improve things very much at all (Except in p = 0.99 where it doesn't make any di↵erence because we get all the SNPs the first time).

What happens in the real data?
The main question really after that exercise is what dynamic does the real data show. Because we have multiple mutants we can check it out. To do this we took the permutations of the two mutants and worked out the reduction in number of SNPs by deleting the overlapping SNPs from the other mutant, for the two permutations of two mutants. These are plotted in Figure 2. The figure shows number of SNPs rather than proportion as in the model, but the dynamic is very similar. The increase in power is marked even with just one extra mutant. The number of SNPs that are di↵erent between a parent line and mutant would be simply the number of SNPs introduced by the mutagenesis and the natural variation between the plants themselves. It is therefore the same as having a single mutant We have then a basic model for the relationship and a tool for examining how many extra mutants a researcher would need to maximise their chances of getting down to the truly unique SNPs in each mutant line.