Home > Methods > De-Novo Transcriptome Assembly and Annotation

De-Novo Transcriptome Assembly and Annotation

ranscriptome sequencing (RNA-seq) helps to find gene expression, reconstruct the transcripts, SNP detection and alternate splicing. There are two assembly approaches i.e. Genome dependent (alignment to reference genome) and genome independent (de novo assembly). De-novo assembly is the process of constructing a reference genome sequence for a newly sequenced organism. And it is necessary because for reference based sequencing reference genome is required and this approach is not useful for organism having partial or missing reference genome. Secondly genome sequences are incomplete, fragmented and altered.  The disadvantage of genome assembly over transcriptome assembly is its inability to account for structural alterations of mRNA transcripts, i.e. alternative splicing.
In RNA-seq first mRNA is extracted and purified from cell and then reverse transcribed to create cDNA library with the help of high-throughput sequencing techniques. This cDNA is fragmented into various lengths.
Algorithm behind assembly is so simple, cDNA sequence reads are assembled into transcripts by using any short read assembler, here we are using ‘SOAPdenovo-Trans’. Short read assemblers generally one of two basic algorithms: overlap graphs – compute pairwise overlap between the reads and capture this information in a graph. De-Bruijin graph: breaks the reads into smaller sequences of DNA, called K-mers, and captures overlaps of length k-1 between these k-mers not between reads. As the number of reads are growing day by day and it is getting difficult to determine which read should be joined to contiguous sequence contigs, so, de-Bruijin is the solution of this problem in which a node is defined by fixed length of k-mer and nodes are connected by edges, if they overlap by k-1 nucleotide.
 De novo assembly work flow
Input files
Paired-end RNA-seq reads of P.chabaudi in fastq format
File 1:         ERR306016_1.fastq
File 2:         ERR306016_2.fastq
Exercise 1. Quality control check using fastQC
Shell command for running fastQC on the read file:
./fastqc ERR306016_1.fastq
A.    Removal of adaptor sequences using FASTX-Toolkit
Shell command:
./fastx_clipper -a -l -C -i ERR306016_1.fastq -o new_ERR306016_1.fastq
B.    Run Quality check again on trimmed files i.e. new_ERR306016_1.fastq and new_ERR306016_2.fastq
Shell command:
Repeat Exercise 1 for ERR306016_2.fastq
Exercise 2. De novo assembly of reads as well as RPKM calculation using
Shell command:
./SOAPdenovo-trans-127mer all -k 23 -R -s sample.conf  -o plas_R
Clustering of resulted contigs or transcriptome using TGICL
Shell command:
./tgicl plas_R.contig
Result will be clustered contig indexes and singletons indexes (contigs which are not in cluster) and asm_1 directory containing CAP3 assembly result. If no singleton file form means all contigs are clustered. Contig sequence and Singlets sequence files inside the asm_1 directory are merged together using linux ‘cat‘ command


Shell command:
cat contig singlets > contig_singlets.txt
Retreive the indexes from the cluster file by using ‘grep’ and ‘sed’
grep -v ‘CL’ plas_R.contig_clusters > contig_index.txt
Substitute the tab space with next line using ‘sed’ command.
sed ‘s/\t/\n/g’ contig_index.txt > contig_index_new.txt
Retreive the clustered contig sequences from SOAPdenovo-trans assembled contig file ‘plas_R.contig’
grep -Fxv -f contig_index_new.txt plas_R.contig > final_singletons.txt
Merge the ‘contig_singlets.txt’ into “final_singletons.txt” using ‘cat’ command
cat contig_singlets.txt final_singletons.txt > final_assembled_transcriptome.txt
Exercise 3. Read mapping using SeqMap
Shell command:
./seqmap 2 ERR306016_1.fastq plas_R_contig.fasta > seqmap_output.txt /eland:3 /available_memory:8000
Exercise 4. Annotation using standalone ncbi-BLAST and online tool KOBAS
Shell command:
./makeblastdb -in /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -input type ‘fasta’ -title ‘plasmo_db’ -dbtype ‘prot’
./blastx -db /home/user/Desktop/NGS_Workshop/jyoti/kobas_data/p.chabaudi.pep.fasta -query /home/user/Desktop/NGS_Workshop/jyoti/plas_R.contig/ -out /kobas_input.fasta/ -outfmt 6
Run KOBAS using ‘kobas_input.fasta’
KOBAS is available on the following url:



  1. May 28, 2014 at 11:10 PM

    Have you ever thought about writing an e-book or guest authoring on other sites?
    I have a blog based upon on the same ideas you discuss and would really like to have you share some
    stories/information. I know my audience would value your work.
    If you are even remotely interested, feel free to send me an email.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: