1 / 42
文档名称:

基于随机森林算法识别多类蛋白质折叠子word论文.docx

格式:docx   大小:771KB   页数:42页
下载后只包含 1 个 DOCX 格式的文档,没有任何的图纸或源代码,查看文件列表

如果您已付费下载过本站文档,您可以点这里二次下载

分享

预览

基于随机森林算法识别多类蛋白质折叠子word论文.docx

上传人:wz_198613 2018/2/28 文件大小:771 KB

下载得到文件列表

基于随机森林算法识别多类蛋白质折叠子word论文.docx

文档介绍

文档介绍:Abstract
With the plishment of the Human Genome Project, the “post genome era” has presented large numbers of protein sequences that require a high-puting method to annotate the structural information. A protein can only perform its physiological functions if it folds into its proper structure. Abnormal protein folding may cause different diseases. For example, the pathogenic prion protein (PRNP), caused by the abnormal folding of proteins, accumulates in the brain and results in neurodegenerative diseases including Alzheimer’s disease, spongiform encephalopathy, Parkinson’s disease, and mad cow disease etc. Thus, the correct identification of protein folds can be valuable for the studies on pathogenic mechanisms and drug design. Thus, the identification of protein folds is a highly important research project in bioinformatics. After the recognition of 27-class protein folds in 2001 by Ding and Dubchak, algorithms, prediction parameters, and new datasets for the prediction of protein folds have been improved. Base on the previous research, our major works are as follows:
(1)Based on the 76-class folds dataset built by Liu et al. in our group, the dataset
was anized in this paper, another 8 and 5 protein sequences were added into the training set and testing set respectively. The sequence identity of the dataset was below 35%. The sequence number of each protein fold type in the dataset was not less than 10. The training set and testing set contained 1744 and 1727 protein chains, respectively. The first 27 types of folds are concordant with Ding and Dubchak’s dataset, and each folds type has been expanded.
(2)Considering the correlation at the level of secondary structure segments, we
proposed the interaction information which reflects the segments-order and long-range correlation information of the sequence. And the information has a major influence on the folding of protein, which hasn’t been considered by previous researchers. As chemical shifts reflects the