Text Box: JR-Assembler

Last modified

2014-09-17

JR-Assembler Frequently Asked Questions (FAQs)

Q1. Is JR-Assembler a free software?

Yes, it is free for academic usage.

Q2. What kind of machine do I need to run JR-Assembler?

JR-Assembler was developed on a 64-bit Linux and has been tested on Linux and Mac OS..

Q3. What kinds of reads can I use for JR-Assembler?

JR-Assembler accepts single-end, overlapping and non-overlapping paired-end, and mate-pair reads.

Q4. Can I use Illumina data along with 454 data?

JR-Assembler was designed mainly for Illumina data. One way to incorporate 454 data is to use the module trimReadKmer (see Instructions) to transform 454 data into fixed-length kmers because currently JR-Assembler only treats reads of the same length.

Generation of the executable script

Q1. Why can't I use JR_script to generate the executable script?

JR_script first checks whether all the required tools can be accessed before generating an execution script file from the configure file. If any of the tools cannot be found , the execution script will not be generated.

Execution problem

Q1. Why JR-Assembler assembly fails when using reads of various lengths?

The current version of JR-Assembler only treats reads of the same length because the kernel was designed for a fixed length data structure.

Q2. Why JR-Assembler produces no or only small subset of the assembly?

A possible reason is that the input reads have too many sequencing errors, thus only few overlaps between reads can be found. There are two solutions for such a case: base correction and 3’-end trimming. If the input reads contain moderate sequencing errors, a base correction tool is suggested because it does not shorten the reads. In contrast, if the sequencing error rate is high, trimming reads at the 3’-end is recommended because most errors occur at the 3’-end. Users can explore the quality score to determine the number of bases to be trimmed. Alternatively, one can try several values and pick the assembly with the longest N50 length.

We plan to make JR-Assembler output the proportion of seed usage, which is the number of seeds used for either seed or contig extension over the total number of seeds. A high ratio indicates that most seeds can either find overlapping reads for extension or are included in the assembled contigs. Thus, the sequencing quality of input reads should be good. In contrast, a low ratio indicates that most seeds cannot find overlapping reads for extension, which implies high sequencing errors. In this case, trimming reads at the 3’-end is recommended.

Sequencing experiment design

Q1. Is there a sequencing strategy suggested for using JR-Assembler?

Similar to the strategies recommended by ALLPATHS-LG (7), we recommend to input an overlapping paired-end library to generate longer connected reads for contig assembly and several mate-pair reads with various insert lengths for long distant jumping during scaffolding. The average genome coverage should be at least 100X~150X or higher. For overlapping paired-end reads, we provide a simple formula to calculate the insert size.

Insert size = (Read length (L) − # of bases to be trimmed) × 2 − maximum overlap length (m).

For example, if the read length is L =150 bp, the maximum overlap length is m = 40 bp, and the number of bases to be trimmed at the 3’ end is 30, then the recommended insert size is 200 bp.

For a small or medium genome (≤ 50 Mb), we suggest to use MiSeq to sequence the genome Because JR-Assembler uses whole reads to assemble contigs. and MiSeq produces longer reads than Hi-Seq200, the MiSeq reads can jump over repeats more frequently. On the other hand, for a large genome (~1 Gb or larger) we recommend HiSeq2000 because of its much higher data throughput for genome assembly.

Questions: jr-assembler@iis.sinica.edu.tw