Understanding the factors governing protein solubility is usually a key to grasp the mechanisms of protein solubility and may provide insight into protein aggregation and misfolding related diseases such as Alzheimer’s disease. Our results provide not only a reliable model for predicting protein solubility but also a list of features important to protein solubility. The predictive model is usually implemented as a freely available web application at http://shark.abl.ku.edu/ProS/. built a model named PROSO based on 14000 protein sequences 16. The model achieved an accuracy of 72% in their assessments 16. More recently the same group of authors reported an improved model PROSOII achieving an accuracy of 75.4% using a logistic function and an adapted Acetate gossypol Parzen windows algorithm based on k-mer properties of 82000 proteins 17. Magnan constructed a SVM model SOLpro with 74% accuracy based on 17000 protein sequences and the frequencies of monomers di-mers and tri-mers of amino acids 18. It should be pointed out that the datasets used to build PROSO PROSOII and SOLpro were collected by incorporating different search results of Protein Data Lender (PDB) 15 Swiss-Prot database and TargetDB 19. The proteins were then classified into soluble and insoluble ones based on the Acetate gossypol annotations of Rabbit Polyclonal to Cytochrome P450 26C1. these proteins. While these methods were best practices when a suitable experimental dataset was not available they may not be usually reliable. For example a soluble protein missing proper annotation can be mistakenly classified as an insoluble one and analyzed the solubility of entire proteome of Escherichia coli (developed a comprehensive decision tree model to classify the soluble and aggregation-prone proteins based on the sequence information21. This model achieves an accuracy of 72 % based on a 10-fold cross validation. Both studies have revealed that amino acid composition molecular excess weight and pI of proteins are relevant to protein solubility. However there is little systematic Acetate gossypol investigation on the relative importance of various types of features used to build reliable models. Thus the goal of this study is to build a model for predicting protein solubility using the most useful and minimal subset features recognized using a state-of-the-art feature selection algorithm. Such a study can provide information for not only accurately predicting protein solubility but also aiding in discovering underlying mechanisms of protein solubility. Materials and methodsd Datasets All proteins used in the study were downloaded from eSOL database (http://tp-esol.genes.nig.ac.jp/)20 in February 2012. Only proteins with available sequences are retained. A protein with solubility < 30% is considered as aggregation-prone and a protein with solubility >70% is considered as soluble20. You will find 2183 proteins including 988 soluble and 1195 aggregation-prone proteins. We then prepare a series of subsets with the sequence identities no higher than 90% 75 50 and 30% using the CD-Hit program22. We use the set of 30% identity including 1918 proteins (886 soluble and 1032 aggregation-prone proteins) to create the final model. Features Each protein is usually encoded with 1438 features that can be grouped into four classes (Table 1). The first class (I) is usually physicochemical properties which are the average values of amino Acetate gossypol acids for a given protein. The second class (II) includes complete counts and normalized complete counts by the length of amino acids for a given protein. The third class (III) is complete counts and normalized complete counts by the protein length of di-peptide for a given protein. The fourth class (IV) includes the remaining features. All 1438 features are sequence-based features or structural features which are predicted from sequences. Although actual structural information should be useful in predicting protein solubility most proteins in eSOL database have no solved structures. In addition previous studies20 21 have revealed that sequence-dependent features can be effective in predicting protein solubility. Table 1 The list of 1438 sequence dependent features Random Forest The Random Forest (RF) algorithm 32 is an ensemble machine learning method that utilizes many impartial decision trees to perform classification or regression. Each of the member trees is built on bootstrap samples from your.