Posts Tagged ‘Protein’

How to make a protein soluble?

April 30, 2014 Leave a comment

Cloning, expression and purification of difficult to clone, express and purify proteins in E. coli 

I have got some mails in relation to the expression of difficult to purify proteins, so I thought of making a short do’s and don’t’s. For pure bioinformatic people, please bear with me for a couple of posts. First of all it is important to know about the protein, gather as much information about the protein as you can. All those small pieces of information help a lot if kept in mind while designing the strategy for cloning, expression and purification of the proteins. Also be informed about the source of protein, eukaryotic or prokaryotic or any others source. Some of the basic parameters like the size of the protein, PI, amino acid composition etc. pays a vital role in designing the strategy. Here are some tools to look for such information

I have compiled on this blog before and Look for other sources too. Main theme is to find as much information about the protein as much one could. I am not a big fan of purifying the protein under denaturing condition. There are lots of question that are difficult to answer if the protein needs to be refolded from denaturing conditions, like if the protein has folded properly, if this is the way the protein is natively folded and not just any random refolding of the protein, which are difficult to demonstrate experimentally until you already have some assay in mind. Since I have tried that too I will end by suggesting what all I have learned on that part.


Downstream experimental procedures: Before designing strategy for Cloning, expression and purification of protein, it is wise to determine the downstream experimental procedure you are going to perform and strategy for Cloning, expression and purification mainly depends on this. At times it is possible to purify the protein in soluble form in very small amount using a very large culture (which is ok, if you need very small amount of protein for downstream experiments) for which one need not go through all the standardization experiments with trials in different vectors and host cells. However, in case if large amount of protein is required (such as in crystallization experiments) it is advised to optimize the purification process overall.

Read as much as you can: There are various resources available for suggestions for cloning, expression and purification of the protein in soluble fraction (i.e. QIAexpress handbook). But please keep in mind that it’s easy to suggest in wet lab work but it takes a lot of time and energy to perform the experiments the way one wishes to, so try what you think is logical and more importantly easily available to you (do-able).

Membrane or membrane associated protein: check if the selected protein is Membrane or membrane associated protein. This can be done by using surface localization tools, some of them are listed here Also, check if the protein Transmembrane domain (TMHMM or signal peptide (Signal P in it. These are hydrophobic regions and are normally intrinsically disordered.  Membrane proteins are bit tough to get in soluble form till one removes the transmembrane or signal peptide part. It is logical to remove the initial (normally N-terminal) transmembrane or signal peptide part to get the functional domain or multiple domains in soluble form. (I had similar problem with a protein I was working on, when removed the signal peptide and transmembrane domain, it solved everything, got the protein into soluble fraction and got purified as charm, got it crystallized also).

Check for the functional domain in protein if any:  This will help in determining the probable function the protein might be having. This will also indicate the other proteins with similar domain and their nature with respect to the cloning, expression and purification of the protein in E. coli. If you can find the protein with the similar domain use the cloning, expression and purification protocol for target protein. Also, for some of the protein the sequence based analysis results/characters change with addition of the tag, keep this in mind too, it might lead to change in PI or so on.

domain analysis

Optimize the temperature: Try different temperature for growth and induction. Induction temperature is more crucial.

  1. Try growing cells at 370 C and induction at 370 C.
  2. Try growing cells at 370 C and induction at 250 C for long time.
  3. Try growing cells at 370 C and induction at 160 C for long time.
  4. Try growing cells at 250 C and induction at 160 C for long time.
  5. Try growing cells at 370 C followed by chilling at 160 C at least one hour before induction.

Low temperature decreases the rate of protein synthesis and usually more soluble protein is obtained. Also, if the temperature is reduced before induction of the cells, it is more likely to yield protein in soluble fraction, it kind of diverts from the pathway of going into inclusion bodies (Sorry, I do not know how).

Optimize the IPTG concentration: it is a good idea to check a gradient in a small scale for the amount of IPTG (using a range from 0.1, 0.2, 0.3 ….mM) required for optimal expression level of the protein. Normally, IPTG is required at very low levels for optimal expression and using higher concentration not only is costly, but also doesn’t show much improvement in the expression level of the protein.

Use a large tag, but make sure to make and arrangement to remove it once you have the protein: Larger tags like intein tag, His-SUMO, GST tag, MBP (maltose binding protein) etc. are known to increase the solubility of proteins, use them if you have the corresponding vectors easily available for them.

Change the vector: using a weaker promoter (e.g. trc instead of T7) and using a lower copy number plasmid normally increases the chance of protein to be purified in soluble fraction. Also, using N- and/or C- terminal tags (in various vectors) affects the solubility of the protein, especially in those protein where folding is dependent on any of these terminals.

Change the host cells: Some of the E. coli strains are better capable of handling toxic or membrane proteins in comparison to others. I had very good experience working with C41 and C43 strains which I came to know through this paper There are also pLysS versions of these strains, I did not try but you can read and try. Other strains like rosetta etc. might also be good to try (depends upon the strains you can get your hands on) (So, beg, borrow or steal ;)). For a new protein I usually perform as many changes one by one as I can do at small scale and then move them onto large scale. Also, check if your protein is using codons that are rarely used in E. coli. You can check ‘rare codon usage’ using different software available.

Change the culture media: After changing and optimizing as many parameters I could, I was getting low level of protein in soluble fraction in LB media, I read somewhere that someone had good yield with the Terrific Broth, I tried and it gave a way more protein in soluble fraction. I was happy to use it thereafter for any protein I had to purify.

Use Auto-induction media: it will be worthwhile trying auto-induction. The idea is that instead of using an inducing agent like IPTG one uses the native function of the T7 promoter. So if you use media containing glucose and lactose and grow the cells, as the glucose is depleted, the cells will slowly start activating their T7 promoters which will start using lactose in place of glucose. This will also induce the promoters on your expression vector and lead to a much more gradual expression than from using IPTG.

To be continued on

Purify the protein under denaturing condition and refold: 

In-silico characterization of proteins

March 27, 2012 Leave a comment

BLAST: In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. Different types of BLASTs are available according to the query sequences. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman, and Webb Miller at the NIH and was published in the Journal of Molecular Biology in 1990

CDD search: Conserved Domain Database (CDD) is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domains, which use 3D-structure information to explicitly to define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

PFAM: The Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Proteins are generally composed of one or more functional regions, commonly termed domains. Different combinations of domains give rise to the diverse range of proteins found in nature. The identification of domains that occur within proteins can therefore provide insights into their function. There are two components to Pfam: Pfam-A and Pfam-B. Pfam-A entries are high quality, manually curated families. Although these Pfam-A entries cover a large proportion of the sequences in the underlying sequence database, in order to give a more comprehensive coverage of known proteins we also generate a supplement using the ADDA database. These automatically generated entries are called Pfam-B. Although of lower quality, Pfam-B families can be useful for identifying functionally conserved regions when no Pfam-A entries are found. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of Pfam-A entries which are related by similarity of sequence, structure or profile-HMM.

TMHMM: A variety of tools are available to predict the topology of transmembrane proteins. To date no independent evaluation of the performance of these tools has been published. A better understanding of the strengths and weaknesses of the different tools would guide both the biologist and the bioinformatician to make better predictions of membrane protein topology.

SignalP: SignalP 4.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.

STRING: STRING is a database of known and predicted protein interactions. The interactions include direct (physical) and indirect (functional) associations; they are derived from four sources i.e. Genomic context, high throughput experiments, coexpression, previous knowledge. STRING quantitatively integrates interaction data from these sources for a large number of organisms, and transfers information between these organisms where applicable. The database currently covers 5’214’234 proteins from 1133 organisms.

PROTPARAM: ProtParam (References / Documentation) is a tool which allows the computation of various physical and chemical parameters for a given protein stored in Swiss-Prot or TrEMBL or for a user entered sequence. The computed parameters include the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY)

PROSITE: Search your query sequence for protein motifs, rapidly compare your query protein sequence against all patterns stored in the PROSITE pattern database and determine what the function of an uncharacterised protein is. This tool requires a protein sequence as input, but DNA/RNA may be translated into a protein sequence using transeq and then queried.

InterPro: InterPro is an integrated database of predictive protein “signatures” used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

Predicting Subcellular Localization of Proteins

March 19, 2012 2 comments

It is interesting to study the localization of proteins in subcellular due to several reasons. Here is a collection of the online available softwares that help in predicting subcellular localization of the proteins. Prediction is done with the help of programs which are trained for this purpose, this greatly helps in selection procedure, to select for a protein to work upon. Though there are more I have enlisted some commonly used.


CELLO : CELLO is a multi-class SVM classification system. CELLO uses 4 types of sequence coding schemes: the amino acid composition, the di-peptide composition, the partitioned amino acid composition and the sequence composition based on the physico-chemical properties of amino acids. We combine votes from these classifiers and use the jury votes to determine the final assignment. Yu CS, Lin CJ, Hwang JK: Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Science 2004, 13:1402-1406.


PSORTb: Based on a study last performed in 2010, PSORTb v3.0.2 is the most precise bacterial localization prediction tool available. PSORTb v3.0.2 has a number of improvements over PSORTb v2.0.4. Version 2 of PSORTb is maintained here. You can currently submit one or more Gram-positive or Gram-negative bacterial sequences or archaeal sequences in FASTA format. Copy and paste your FASTA-formatted sequences into the textbox below or select a file containing your sequences to upload from your computer.


TMHMM Server: This server is for prediction of transmembrane helices in proteins. You can submit many proteins at once in one fasta file. Please limit each submission to at most 4000 proteins. Please tick the ‘One line per protein’ option. Please leave time between each large submission.S. Moller, M.D.R. Croning, R. Apweiler. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics, 17(7):646-653, July 2001.


SignalP 3.0 Server: SignalP 3.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models. Locating proteins in the cell using TargetP, SignalP, and related tools Olof Emanuelsson, Søren Brunak, Gunnar von Heijne, Henrik Nielsen Nature Protocols 2, 953-971 (2007).


LOCtree: LOCtree can predict the subcellular localization and DNA-binding propensity of non-membrane proteins in non-plant and plant eukaryotes as well as prokaryotes. LOCtree classifies eukaryotic animal proteins into one of five subcellular classes, while plant proteins are classified into one of six classes and prokaryotic proteins are classified into one of three classes . The novel feature of using a hierarchical architecture is the ability to make intermediate localization class predictions at much higher accuracy’s. Another source of improvement is the use of ‘noisy’ training data. ‘Noisy’ predictions from LOCKey (SWISS-PROT keyword based annotations) and LOCHom (annotations using sequence homology) are used to train the hierarchical SVMs.


PredictProtein: PredictProtein integrates feature prediction for secondary structure, solvent accessibility, transmembrane helices, globular regions, coiled-coil regions ,structural switch regions, B-values, disorder regions, intra-residue contacts, protein-protein and protein-DNA binding sites, sub-cellular localization, domain boundaries, beta-barrels, cysteine bonds, metal binding sites and disulphide bridges.

Protein Blast against another set of proteins

March 17, 2012 Leave a comment
Protein Blast against another set of proteins

This tool is provided by NCBI/ BLAST/ blastp suite: BLASTP programs search protein databases using a protein query.This gives BLAST of a query protein against a set of other proteins. I found it useful when you don’t wish to BLAST your query against whole protein database, instead a set of proteins given by the user. This tool is located here, Protein Blast against another set of proteins

%d bloggers like this: