ChIP-seq Analysis in R (CSAR): An R package for the statistical detection of protein-bound genomic regions
© Muiño et al; licensee BioMed Central Ltd. 2011
Received: 16 March 2011
Accepted: 9 May 2011
Published: 9 May 2011
In vivo detection of protein-bound genomic regions can be achieved by combining chromatin-immunoprecipitation with next-generation sequencing technology (ChIP-seq). The large amount of sequence data produced by this method needs to be analyzed in a statistically proper and computationally efficient manner. The generation of high copy numbers of DNA fragments as an artifact of the PCR step in ChIP-seq is an important source of bias of this methodology.
We present here an R package for the statistical analysis of ChIP-seq experiments. Taking the average size of DNA fragments subjected to sequencing into account, the software calculates single-nucleotide read-enrichment values. After normalization, sample and control are compared using a test based on the ratio test or the Poisson distribution. Test statistic thresholds to control the false discovery rate are obtained through random permutations. Computational efficiency is achieved by implementing the most time-consuming functions in C++ and integrating these in the R package. An analysis of simulated and experimental ChIP-seq data is presented to demonstrate the robustness of our method against PCR-artefacts and its adequate control of the error rate.
The software ChIP-seq Analysis in R (CSAR) enables fast and accurate detection of protein-bound genomic regions through the analysis of ChIP-seq experiments. Compared to existing methods, we found that our package shows greater robustness against PCR-artefacts and better control of the error rate.
Genome-wide identification of in vivo protein-bound genomic regions is essential for a full understanding of transcriptional regulation. DNA fragments that are bound by proteins in vivo can be isolated by chromatin-immunoprecipitation (ChIP) and subsequently identified using microarrays (ChIP-chip) or high-throughput sequencing technologies (ChIP-seq). Recent studies [1, 2] indicate that the ChIP-seq approach provides higher resolution and statistical power than ChIP-chip. To date, only two methods have been described for the analysis of ChIP-seq experiments in plants, i.e.  and the method developed by our group [2, 4].
The common approach to analyze the millions of short sequence reads obtained in a typical ChIP-seq experiment is to map them to a reference genome using one of several mapping tools available, for example SOAPv2, Bowtie, or BWA [5–7]. Reads that map to multiple locations in the genome, so called 'multireads' , are often discarded to avoid the ambiguity of their genomic origin. To account for varying sequencing depths among the different samples in an experiment, current methods typically standardize the number of mapped reads across all samples by a scaling factor. However, it is becoming evident that more sophisticated normalization procedures are needed, since differences in coverage distribution among samples not only depend on the sequencing depth, but also on other properties of the sample , including methodological differences in library preparation, as well as biological differences in the chromatin state of the samples. We are aware of only two published ChIP-seq analysis methods that normalize the data to obtain the same coverage distribution across samples. The PeakSeq method  applies a scaling factor that is obtained from the linear regression between IP and control sample coverages, while in  a quantile normalization method is proposed. Here we describe the implementation of the approach introduced by our group , in which the statistical method of moments is used for the normalization process.
Subsequent to normalization, enrichment of genomic regions is commonly evaluated with a test statistic based on the Poissson or Binomial distribution. To control the false discovery rate (FDR) of such a test, it is necessary to obtain the distribution of the test statistics under the null hypothesis. Some methods, e.g. CisGenome , assume this distribution as known a priori, given the statistical properties of the test. However, this assumption strongly depends on how well the distribution used to construct the test statistics (e.g. Poisson distribution) can represent the real data. Another strategy is to try to empirically estimate the distribution of the test statistic under the null hypothesis; the most common method is to assume that the score values obtained in the comparison of the IP sample against the control will be a good estimation of the desired distribution. Examples of software that implement this approach are PICS , MACS  or QuEST . A second problem with the use of the Poisson test is that the comparison of different ChIP-seq experiments is not straightforward, since the obtained scores or p-values will depend on the statistical power of each particular experiment (e.g. number of replicates, number of reads obtained, etc.). A review of existing algorithms is given in .
Summary of read statistics for the ChIP-seq libraries analysed
No. of sequenced reads
No. of mapped reads
No. of non-duplicated mapped reads
Percentage of duplicated mapped reads
We present here an R package that implements the statistical methodology previously outlined by our group [2, 4]. The method was developed to efficiently handle high-copy numbers of reads that result from PCR artefacts without the need of eliminating duplicated sequences. The coverage distribution of samples is normalized to obtain the same mean and variance across samples. Users can choose between Poisson or ratio-based testing. FDR control is achieved through the well-known method of permutations . The most time-consuming functions are implemented in C++ and are fully integrated in the package. A comparison with three other publically available methods is presented in the context of plant ChIP-seq analysis.
The software accepts any plain text, tabular data format containing the following information for each mapped read: chromosome, location (bp), strand (+/-), read length (bp), and number of times mapped on the genome. Users can define specific input table formats in addition to the default option of the package, which expects the standard AlignedRead format supported by Bioconductor or the output of the mapping program SOAPv2. The average length of the DNA fragments subjected to sequencing must be provided by the user.
- 1)Variable number of mapped reads among sequencing experiments. As commonly handled in the literature, the CSAR package allows normalization of the data by reporting the number of hits per θ millions of reads, where θ is an arbitrary number. Namely, the counts x is are transformed to
- 2)Variable number of regions sequenced. In the IP sample, reads will come preferentially from true positive and false positive protein-bound regions, while in the control sample, reads will come preferentially from false positive regions. This will result in different coverages in the IP and control samples that should be taken into account in the analysis . In contrast to other packages, CSAR will normalize the read-coverage distribution in the IP sample to have the same mean and variance as the control sample. Namely, the observed y i2 are transformed to
where and denote the mean and standard deviation of y is .
After normalization, a score (t i ) is calculated for each nucleotide position i on the basis of the Poisson-based or ratio (default) test.
where n 0 denotes the number of genomic positions for which x i2 > 0. In our experience, the ratio test gives more comparable results among different experiments, which is due to the fact that its score value is less dependent on the statistical power of the experiments as for the Poisson test.
Candidate peaks are defined as genomic regions with score values (t i ) higher than a given cut-off. Candidate peaks separated by less than 100 bp (default parameter value) are merged. The maximum score value of the candidate peak is used as the test statistic value to test its significance.
In contrast to other packages, CSAR subsequently uses a permutation method to obtain the test-statistic threshold corresponding to a desired FDR level. Individual mapped reads are labeled as "control" or "IP" if they belong to either the control or IP sample, respectively. The labels are then randomly permuted between the mapped reads, and the new permutated datasets are subjected to the previously described ChIP-seq analysis. Since this permutation process removes any relationship between the mapped reads and the sample they came from , the score values obtained over a sufficient number of permutations will provide an accurate estimation of the score distribution under the null hypothesis that can be used to control the error rate, for example FDR.
CSAR can generate results regarding genomic positions of significantly read-enriched regions and their distance to annotated genomic features (e.g. genes, other annotated binding events) in tabular format. These can be directly used by other R functions or packages for further analysis or for graphical representation. The read-enriched genomic regions can be written to a UCSC web-browser compatible wiggle (wig) file and visualized (Figure 1D) with, for example, the Integrated Genome Browser . The default parameters in CSAR are optimized for Arabidopsis ChIP-seq data, but they can easily be adjusted for other organisms.
Results and Discussion
CSAR has been successfully used to analyze several plant ChIP-seq experiments and was shown to be computationally efficient and accurate [2, 19]. Table 1 summarizes characteristics of Illumina sequence libraries that were reanalyzed in this study in order to compare the performance of CSAR (v1.4.0) with four other publicly available methods, i.e. QuEST (v2.4), PICS (v1.4.0), MACS (v1.4.0rc2) and Cisgenome (v1.2) [12–15]. SEPALLATA3 (SEP3) and APETALA1 (AP1) are two MADS-domain transcription factors involved in the regulation of floral development in Arabidopsis thaliana. Datasets S1 and Sc represent an experimental IP and control libraries for a SEP3 ChIP-seq experiment . S6, S7 and S8 represent sequencing libraries from the same IP experiment, except that low amounts of DNA were recovered from the ChIP step. Standard Illumina protocol was used for the library preparation. S6, and S7 were prepared according to the standard protocol and PCR amplified in 20 cycles. An additional second PCR amplification step (+10 cyles) was performed to the library S8. The amplification produced high numbers of duplicate reads (Table 1), with library S8 most affected. We used these libraries to evaluate the robustness of our method against PCR artefacts. Datasets S2-S5 represent in silico modifications of the S1 library. At random, 2000 uniquely mapped reads from S1 were amplified one hundred times each and added to the original S1 dataset. This process was repeated four times to generate the four dataset S2-S5. Datasets A1 and Ac represent the IP and control libraries, respectively, combining two biological AP1 ChIP-seq replicates . Libraries A1 and Ac were sequenced on the Genome Analyzer II, the others on the Genome Analyzer I; all libraries were sequenced to a 36 bp read length. Table 1 summarizes the number of mapped reads, as well as the percentage of duplicate reads present in each dataset.
SOAPv2 (default parameters) was used to uniquely map reads to the Arabidopsis genome (ATH1.1con.01222004; ftp://ftp.arabidopsis.org/). Reads mapping to the chloroplast or mitochondrial genomes were discarded. Remaining reads were analyzed with default parameters at an FDR level <0.05 by CSAR, QuEST, PICS, MACS and Cisgenome [12–15] using the appropriate dataset as a control.
For AP1, publically available gene expression data could be used to validate peak calling. The expression data was generated in AP1 induction experiments on the same tissue that was used in our AP1 ChIP-seq experiment . Figure 2B shows the percentage of significant AP1 peaks declared by each method close to at least one potential direct target gene, where the target genes were as the ones which were differentially expressed in the time-series gene expression data . CSAR shows a stronger enrichment than other methods.
Number of significant regions detected
S1 vs Sc
S2-S5 vs Sc*
S6 vs Sc
S7 vs Sc
S8 vs Sc
In the analysis of the in silico modified libraries S2-S5, MACS, CSAR and QuEST are the most robust methods with respect to the presence of high numbers of duplicate reads, as indicated by the low percentages of False Positives, an error rate below the 5% FDR control desired. A possible cause for the high percentage of False Positives obtained by Cisgenome in our in silico modified datasets might be in its FDR estimation step. Cisgenome assumes a Negative Binomial or a Poisson distribution for the score distribution under the null hypothesis. However, the presence of high numbers of duplicate reads will modify its original distribution and will have a strong effect on the FDR estimation.
In the case of the experimental libraries which had high levels of duplicate reads (S6, S7 and S8), CSAR clearly shows a lower percentage of False Positives than all other packages, with an error rate close to the desired 5% FDR control. Because one might argue that this is done at the cost of having a relatively small number of significant regions declared in comparison to other packages, we repeated computations in CSAR with a more relaxed error control that gave 80% of false positives (a rate similar to the one actually obtained for MACS). In this way, 717 true positive (common) regions were found for S6 (out of 3,597 significant regions), 771 for S7 (out of 3,737), and 655 for S8 (out of 3,307), which is comparable with the number of true positives obtained with MACS. It is interesting to note that although MACS shows 0% of False Positives in the in silico libraries, in the experimental libraries, the error increases to an average of 79%. MACS (default options) eliminates reads that map to the exact same positions and strand above a maximum number. For this reason, MACS eliminates the reads added in silico since these have the same sequence and therefore the same position and strand. In the experimental libraries, this strategy apparently did not work out. We hypothesize that due to degradation of the DNA fragments subjected to sequencing or due to sequencing errors, the short reads obtained from fragments with the same sequence will not always have the exact same positions, preventing MACS from eliminate them. In the CSAR approach this is not a problem because it requires both strands to support the binding event independently.
Since the percentage of duplicate reads can be easily calculated, we advise to always report it as a measure of quality in future ChIP-seq experiments. In this study we used the libraries S6 - S8 as extreme examples of the effect of PCR artefacts, but we advise in general against working with high levels of duplication in a normal ChIP-seq experiment. Further study should establish more precisely which levels of duplication are still acceptable. This should be done in combination with evaluating other parameters such as the number of mapped reads. When working with proteins that bind preferentially to promoter regions, we found it useful to graphically represent for each experimental library the distribution of distances (bp) between the position of read-enriched regions and the start position of genes; in such graphs one should typically see enrichment in the expected positions (e.g.: promoter regions for SEP3 and AP1 TFs). If this is not the case, this might be an indication of a problem in the experimental IP enrichment. CSAR provides functions to easily calculate and visualize this distribution and to report the number of duplicate reads.
In conclusion, the CSAR package, implemented in the popular R language, provides an accurate and efficient tool for the analysis of plant ChIP-seq data. It shows better accuracy compared to other methods in the two plant ChIP-seq experiments considered, and, in particular, it shows a high level of robustness against PCR-artefacts. A good error rate control is one of the most important features of any statistical process, and CSAR shows a good control even with a high percentage of duplicate reads.
Availability and requirements
Project name: CSAR
Project home page: http://bioconductor.org/packages/release/bioc/html/CSAR.html
Operating system(s): Platform independent
Programming language: R
Other requirements: R version 2.8.1 or superior
Any restrictions to use by non-academics: None
The software (source code) and examples are attached in Additional file 1. It can also be downloaded via the project home page.
Funding: JMM was supported by grants from the Netherlands Bioinformatics Centre (NBIC), which is part of the Netherlands Genomics Initiative, and from the Netherlands Organization for Scientific Research (NWO; Horizon grant #93519020).
- Johnson D, Mortazavi A, Myers R, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316 (5830): 1497-1502. 10.1126/science.1141319.View ArticlePubMedGoogle Scholar
- Kaufmann K, Muino JM, Jauregui R, Airoldi CA, Smaczniak C, Krajewski P, Angenent GC: Target Genes of the MADS Transcription Factor SEPALLATA3: Integration of Developmental and Hormonal Pathways in the Arabidopsis Flower. Plos Biology. 2009, 7 (4): 854-875.View ArticleGoogle Scholar
- Yant L, Mathieu J, Dinh TT, Ott F, Lanz C, Wollmann H, Chen X, Schmid M: Orchestration of the Floral Transition and Floral Development in Arabidopsis by the Bifunctional Transcription Factor APETALA2. Plant Cell. 2010, 22 (7): 2156-2170. 10.1105/tpc.110.075606.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaufmann K, Muino JM, Osteras M, Farinelli L, Krajewski P, Angenent GC: Chromatin immunoprecipitation (ChIP) of plant transcription factors followed by sequencing (ChIP-SEQ) or hybridization to whole genome arrays (ChIP-CHIP). Nat Protocols. 2010, 5 (3): 457-472. 10.1038/nprot.2009.244.View ArticlePubMedGoogle Scholar
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009, 25 (15): 1966-1967. 10.1093/bioinformatics/btp336.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Meth. 2009, 6 (11s): S22-S32. 10.1038/nmeth.1371.View ArticleGoogle Scholar
- Robinson M, Oshlack A: A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010, 11 (3): R25-10.1186/gb-2010-11-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotech. 2009, 27 (1): 66-75. 10.1038/nbt.1518.View ArticleGoogle Scholar
- Johannes F, Wardenaar R, Colome-Tatche M, Mousson F, de Graaf P, Mokry M, Guryev V, Timmers HTM, Cuppen E, Jansen RC: Comparing genome-wide chromatin profiles using ChIP-chip or ChIP-seq. Bioinformatics. 2010, 26 (8): 1000-1006. 10.1093/bioinformatics/btq087.View ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotech. 2008, 26 (11): 1293-1300. 10.1038/nbt.1505.View ArticleGoogle Scholar
- Zhang X, Robertson G, Krzywinski M, Ning K, Droit A, Jones S, Gottardo R: PICS: Probabilistic Inference for ChIP-seq. Biometrics. 2010,Google Scholar
- Zhang Y, Liu T, Meyer C, Eeckhoute J, Johnson D, Bernstein B, Nussbaum C, Myers R, Brown M, Li W: Model-based Analysis of ChIP-Seq (MACS). Genome Biology. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMedGoogle Scholar
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Meth. 2008, 5 (9): 829-834. 10.1038/nmeth.1246.View ArticleGoogle Scholar
- Kozarewa I, Ning ZM, Quail MA, Sanders MJ, Berriman M, Turner DJ: Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G plus C)-biased genomes. Nature Methods. 2009, 6 (4): 291-295. 10.1038/nmeth.1311.PubMed CentralView ArticlePubMedGoogle Scholar
- Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H, Turner DJ: A large genome center's improvements to the Illumina sequencing system. Nat Meth. 2008, 5 (12): 1005-1010. 10.1038/nmeth.1270.View ArticleGoogle Scholar
- Aird D, Ross M, Chen WS, Danielsson M, Fennell T, Russ C, Jaffe D, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology. 2011, 12 (2): R18-10.1186/gb-2011-12-2-r18.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaufmann K, Wellmer F, Muino JM, Ferrier T, Wuest SE, Kumar V, Serrano-Mislata A, Madueno F, Krajewski P, Meyerowitz EM: Orchestration of Floral Initiation by APETALA1. Science. 2010, 328 (5974): 85-89. 10.1126/science.1185244.View ArticlePubMedGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J: Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells. Cell. 2008, 133 (6): 1106-1117. 10.1016/j.cell.2008.04.043.View ArticlePubMedGoogle Scholar
- Peng JC, Valouev A, Swigut T, Zhang J, Zhao Y, Sidow A, Wysocka J: Jarid2/Jumonji Coordinates Control of PRC2 Enzymatic Activity and Target Gene Occupancy in Pluripotent Cells. Cell. 2009, 139 (7): 1290-1302. 10.1016/j.cell.2009.12.002.PubMed CentralView ArticlePubMedGoogle Scholar
- Doerge RW, Churchill GA: Permutation Tests for Multiple Loci Affecting a Quantitative Character. Genetics. 1996, 142 (1): 285-294.PubMed CentralPubMedGoogle Scholar
- Nicol JW, Helt GA, Blanchard SG, Raja A, Loraine AE: The Integrated Genome Browser: free software for distribution and exploration of genome-scale datasets. Bioinformatics. 2009, 25 (20): 2730-2731. 10.1093/bioinformatics/btp472.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.