Posts Tagged ‘InterPro’

Functional Annotation of Hypothetical proteins

April 30, 2014 Leave a comment

Cloning, expression and purification of difficult to clone, express and purify proteins in E. coli 

Experimental work is though time taking but direct approach for functional annotation of hypothetical proteins; however, at times it is difficult to decide upon the experimental design for a relatively new class of a protein. With increasing size and quality of various protein databases, it is becoming relatively easier to look for the experimental design for the probable function of a protein. Following are the steps that can be used in choosing the type of experimental analysis that needs to be performed and the substrate to be used during laboratory tests.

If the protein is predicted to be an enzyme, BLAST results normally indicates its closely related proteins that can be looked upon for the experimental procedures to be performed as indicated by the matching hits (look for the papers on those proteins that might indicate the type of related function the protein might perform).

With the increasing domain databases, it is possible to analyze the protein domain wise indicating the ability to perform certain kind of biochemical reactions if any. The NCBI’s Conserved Doamin Database (CDD), Pfam and InterProScan databases have a large number of conserved domains that defines a functional class. Presence of certain domain is also indicative of the possible activity of the protein and therefore the type of substrate to be used for defining its chemical activity in laboratory could be helpful.


Composition based analysis of protein: there are various bioinformatics tools available online to studying the amino acid composition based analysis of protein informing various properties which help in indicating the properties of protein which later help with the functional annotation of the proteins i.e. ProtparamSPAANMP3 and a lot more etc.

Homology based modeling: this is an important step in determining the functional annotation of protein based on the structure of the protein, though it may be difficult for the proteins with low identity (<30%) with the already known crystal structures of the protein. However, a good homology model can be an important step towards determining functional annotation for a protein. So also the secondary and tertiary structure prediction of the protein will tell the similar functional categories thereby help in designing relative experimental assays. Some of the commonly used homology based modeling tools are listed here


Phylogenetic analysis: Phylogenetic analysis not only shows evolutionary divergence of the protein but also act as an important step towards functional conservation of the protein. This helps in determining the degree of functional similarity with other related homologous proteins. Thus, determining the appropriate experimental assays towards functional annotation of the protein. With the help of molecular dynamic simulation, this also helps in-silico assessment of the ability of substrate to bind to the protein. In fact it can cut down from large number of substrate molecules to the top most hits, helping to prioritize the experimental analysis, saving time and resources.

phylogenetic analysis

It is sometimes a bit difficult while working with novel proteins for which relevant data is almost negligible worldwide, so you can wait till you get more information.

Let me know if you have more suggestions to add on.


In-silico characterization of proteins

March 27, 2012 Leave a comment

BLAST: In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman, and Webb Miller at the NIH and was published in the Journal of Molecular Biology in 1990

CDD search: Conserved Domain Database (CDD) is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domains, which use 3D-structure information to explicitly to define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

PFAM: The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

TMHMM: A variety of tools are available to predict the topology of transmembrane proteins. To date no independent evaluation of the performance of these tools has been published. A better understanding of the strengths and weaknesses of the different tools would guide both the biologist and the bioinformatician to make better predictions of membrane protein topology.

SignalP: SignalP 4.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.

STRING: STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources i.e. Genomic context, high throughput experiments, coexpression, previous knowledge. STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 5’214’234 proteins from 1133 organisms.

PROTPARAM: ProtParam (References / Documentation) is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY)

PROSITE: Search your query sequence for protein motifs, rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database and determine what the function of an uncharacterised protein is. This tool requires a protein sequence as input, but DNA/RNA may be translated into a protein sequence using transeq and then queried.

InterPro: InterPro is an integrated database of predictive protein “signatures” used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

%d bloggers like this: