TY - JOUR AU - Zou, Quan AB - Abstract The peptide therapeutics market is providing new opportunities for the biotechnology and pharmaceutical industries. Therefore, identifying therapeutic peptides and exploring their properties are important. Although several studies have proposed different machine learning methods to predict peptides as being therapeutic peptides, most do not explain the decision factors of model in detail. In this work, an Interpretable Therapeutic Peptide Prediction (ITP-Pred) model based on efficient feature fusion was developed. First, we proposed three kinds of feature descriptors based on sequence and physicochemical property encoded, namely amino acid composition (AAC), group AAC and coding autocorrelation, and concatenated them to obtain the feature representation of therapeutic peptide. Then, we input it into the CNN-Bi-directional Long Short-Term Memory (BiLSTM) model to automatically learn recognition of therapeutic peptides. The cross-validation and independent verification experiments results indicated that ITP-Pred has a higher prediction performance on the benchmark dataset than other comparison methods. Finally, we analyzed the output of the model from two aspects: sequence order and physical and chemical properties, mining important features as guidance for the design of better models that can complement existing methods. therapeutic peptides prediction, CNN-BiLSTM, feature fusion, interpretability analysis Introduction With the increase in drug resistance in the past few decades, drug production, cost, and research and development are facing major challenges [1]; antibiotics in the market have experienced significant delays, and even verification has affected the development of drugs [2–4]. As a strategy to solve this problem, peptide drugs have attracted considerable interest. Compared with traditional drugs involving small molecules, peptide drugs are more effective, selective and specific, and their degradation products are amino acids, thereby reducing the risk of drug–drug interactions. The high approval rate of peptide drugs since the beginning of this century suggests that these drugs are gradually playing a leading role in modern medicine and pharmacy. Thus far, antibacterial peptides, anticancer peptides (ACPs), anti-inflammatory peptides, antiviral peptides, cell-penetrating peptides (CPPs), quorum-sensing peptides (QSPs) and other therapeutic peptides have been widely researched and applied. We chose CPPs and QSPs for detailed introduction and research. QSPs, as a signal molecule [5], can help organisms regulate various physiological activities, such as bioluminescence, virulence factor expression, antibiotic production, biofilm formation, sporulation, swarm movement and heredity [6–8]. Therefore, identifying QSP is extremely important to further understand the functional mechanism of QSP. CPPs are a kind of substances, such as peptides [9, 10], proteins, drugs [11, 12], nucleic acids [13, 14] and siRNAs [15], and can carry a variety of substances across the cell membrane. Once combined with CPP, almost all substances can be transported into the cell [16]. Hence, CPPs have a remarkable therapeutic potential [17, 18], meeting the key needs of molecular medicine for new diagnostic drugs [19, 20]. Although many peptide-based therapies have been used in preclinical applications, the specific mechanism remains unclear [21]. In short, identifying active ingredients of peptides is of great significance to basic research and drug development. At present, the identification methods of therapeutic peptides can be divided into two categories, namely, biological wet experimental method and computational-assisted identification method. Considering that the wet experimental identification of functional proteins is expensive and time-consuming, researchers have begun to develop efficient calculation methods [22]. Computing-aided identification methods can be roughly divided into two categories: similarity-based search methods and machine learning-based methods. The similarity-based search method is implemented through the basic local alignment search tool [23] and is position-specific iterated [24, 25]. However, the search performance of this method will drop sharply when dealing with massive datasets. A large number of machine learning methods can efficiently and precisely predict therapeutic peptides on a large scale by using existing protein function annotation information and various protein data (Akbar et al., 2017; Li and Wang, 2016; Manavalan et al., 2017). Examples include classic support vector machines (SVM) and random forest (RF) machine learning methods, a combination of several single classifiers (RF, K-nearest neighbor, SVM, generalized neural network and probabilistic neural network), or a support based on sequence feature training vector machine model library method [26]. These research methods show that machine learning predictors can achieve reliable performance. However, existing challenges limit the efficient prediction of therapeutic peptides. First, with the avalanche of protein sequences by high-throughput sequencing technology, a large number of potential therapeutic peptides in proteins remain to be explored. However, no calculation method can directly detect a certain therapeutic peptide. Second, for predictors based on machine learning, feature representation is critical to performance improvement [27], and the problem is facilitating the flexible use of existing descriptor information to train an effective prediction model [28]. Saha et al. [29] proposed a new computational model based on amino acid composition (AAC) and dipeptide composition SVM model to identify toxic proteins in the pathophysiology of infections. Wei et al. [5] predicted QSPs based on feature representation learning and machine learning algorithms. Yi et al. [30] used deep learning to predict ACPs by integrating binary contour features and k-mer sparse matrices. Zhang et al. [31] proposed a novel feature coding and learning scheme based on RF to identify six therapeutic peptides. These methods include a variety of feature representation algorithms but have the drawbacks of high-dimensional features and general experimental results. Finally, most models lack interpretability analysis and have unclear decision-making factors. Thus, the computational prediction method of therapeutic peptide still needs further development. We aimed to overcome these challenges by proposing a new deep learning computational model Interpretable Therapeutic Peptide Prediction (ITP-Pred) to identify QSPs and CPPs based on the descriptors in the sequence representation and combining statistics concept and the physicochemical properties of the sequence autocorrelation coefficient (AC). The main work and contributions are as follows. First, ITP-Pred was combined with the latest database to collect two sets of therapeutic peptide data for cross-validation and independent verification, and the results were compared with the classic methods and the latest method for identifying therapeutic peptides. The proposed evaluation indicators had a promising performance, thereby proving the effectiveness of our model. Second, based on sequence and physicochemical property encoded of protein, three feature subsets of pairwise combinations were constructed. The experimental comparison results show that ITP-Pred’s feature representation contributes the most to the model. Finally, we used the interpretability of deep learning to analyze the importance and physicochemical properties of features, which provide an important supplement to related research on extracting sequence features. Materials and methods The entire workflow of the ITP-Pred method is shown in Figure 1. The main stages in the development of ITP-Pred are described in the following subsections. Figure 1 Open in new tabDownload slide Flow chart of the ITP-Pred method. The left side of Figure 1 shows that peptide sequences are extracted from the curated database, and two descriptors and one coded numerical feature were used to represent peptide sequences; On the right side of Figure 1, the features are input into the deep learning model CNN-BiLSTM, and finally, the dense layer predicts the possibility that the peptide is a therapeutic peptide; Below the Figure 1 is the performance evaluation and model’s function. Figure 1 Open in new tabDownload slide Flow chart of the ITP-Pred method. The left side of Figure 1 shows that peptide sequences are extracted from the curated database, and two descriptors and one coded numerical feature were used to represent peptide sequences; On the right side of Figure 1, the features are input into the deep learning model CNN-BiLSTM, and finally, the dense layer predicts the possibility that the peptide is a therapeutic peptide; Below the Figure 1 is the performance evaluation and model’s function. Dataset QSP400 dataset Many studies have shown that strict datasets are critical to the effect of model predictions. The QSP400 dataset in this article was collected from literature [32]. The positive samples of the dataset are nonredundant peptides that have been processed by CD-HIT [33] program. The negative samples were obtained using two methods: extracting the negative dataset from UniProt [34] and disturbing the QSP sequence. In the end, 440 QSP sequences were obtained, 400 of which were used for training and test, of which half of the positive data and negative data were obtained. The remaining samples were used for independent verification. CPP740 dataset The CPP740 dataset was collected from literature [35]. The sequence identity shared by any two sequences in the existing dataset was ≤80%, which had the lowest level of sequence identity. In addition, it contained 370 high-absorbency CPPs as positive samples and an equal number of low-absorbency CPPs as negative samples. These 740 CPPs were used for the following training model and prediction. Finally, 92 CPP-positive and negative samples different from the previous training set were used for independent verification. In addition, to quantify the difference between the training set and the independent verification set, we studied the data distribution from two aspects. Firstly, the number proportion of the 20 amino acid types in Figure 2 and Tables 1 and 2; Secondly, the visual distribution of the peptide length in Figure 3. Figure 2 Open in new tabDownload slide The distribution of amino acid differences between the training set and the independent test. (A) Distribution map of CPP740. (B) Distribution map of QSP400. Figure 2 Open in new tabDownload slide The distribution of amino acid differences between the training set and the independent test. (A) Distribution map of CPP740. (B) Distribution map of QSP400. Table 1 20 amino acid quantity statistics for CPPs CPPs . A . C . D . E . F . G . H . I . K . L . Train 1032 532 530 627 572 874 595 640 1290 1246 Independent 256 129 117 129 148 198 202 168 352 327 Type M N P Q R S T V W Y Train 515 582 736 603 1658 712 639 596 548 535 Independent 115 152 172 158 444 188 155 129 176 131 CPPs . A . C . D . E . F . G . H . I . K . L . Train 1032 532 530 627 572 874 595 640 1290 1246 Independent 256 129 117 129 148 198 202 168 352 327 Type M N P Q R S T V W Y Train 515 582 736 603 1658 712 639 596 548 535 Independent 115 152 172 158 444 188 155 129 176 131 Open in new tab Table 1 20 amino acid quantity statistics for CPPs CPPs . A . C . D . E . F . G . H . I . K . L . Train 1032 532 530 627 572 874 595 640 1290 1246 Independent 256 129 117 129 148 198 202 168 352 327 Type M N P Q R S T V W Y Train 515 582 736 603 1658 712 639 596 548 535 Independent 115 152 172 158 444 188 155 129 176 131 CPPs . A . C . D . E . F . G . H . I . K . L . Train 1032 532 530 627 572 874 595 640 1290 1246 Independent 256 129 117 129 148 198 202 168 352 327 Type M N P Q R S T V W Y Train 515 582 736 603 1658 712 639 596 548 535 Independent 115 152 172 158 444 188 155 129 176 131 Open in new tab Table 2 20 amino acid quantity statistics for QSPs QSPs . A . C . D . E . F . G . H . I . K . L . Train 683 533 288 304 551 788 139 524 728 799 Independent 87 61 20 35 50 96 17 69 80 73 Type M N P Q R S T V W Y Train 172 451 332 268 416 619 414 454 162 225 Independent 14 53 33 24 33 77 43 62 16 27 QSPs . A . C . D . E . F . G . H . I . K . L . Train 683 533 288 304 551 788 139 524 728 799 Independent 87 61 20 35 50 96 17 69 80 73 Type M N P Q R S T V W Y Train 172 451 332 268 416 619 414 454 162 225 Independent 14 53 33 24 33 77 43 62 16 27 Open in new tab Table 2 20 amino acid quantity statistics for QSPs QSPs . A . C . D . E . F . G . H . I . K . L . Train 683 533 288 304 551 788 139 524 728 799 Independent 87 61 20 35 50 96 17 69 80 73 Type M N P Q R S T V W Y Train 172 451 332 268 416 619 414 454 162 225 Independent 14 53 33 24 33 77 43 62 16 27 QSPs . A . C . D . E . F . G . H . I . K . L . Train 683 533 288 304 551 788 139 524 728 799 Independent 87 61 20 35 50 96 17 69 80 73 Type M N P Q R S T V W Y Train 172 451 332 268 416 619 414 454 162 225 Independent 14 53 33 24 33 77 43 62 16 27 Open in new tab Figure 3 Open in new tabDownload slide The difference between the training set and the validation set was quantified based on the peptide sequence length. Figure 3 Open in new tabDownload slide The difference between the training set and the validation set was quantified based on the peptide sequence length. Feature for peptide sequence Amino acid composition The primary structure of a protein is composed of 20 kinds of amino acids [36]. Among them, the coded AAC is a means to calculate the frequency of each type of amino acid in a protein or peptide sequence, which is represented by characters as shown in formulas (1) and (2): $$\begin{equation} {P}_{AAC}=\left\{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y\right\}, \end{equation}$$(1) $$\begin{equation} f(t)=\frac{N(t)}{N}, \end{equation}$$(2) where |$N(t)$| is the number of amino acids and |$N$| is the length of the protein or peptide sequence. We extracted 20 protein features from AAC (six descriptors) based on the default parameters of iFeature [37]. Group AAC In group AAC (GAAC) coding, the 20 amino acids are further divided into five categories according to their physical and chemical properties [38]. These five types of compounds include aliphatic groups (G1: GAVLMI), aliphatic groups (G2: FYW), positively charged groups (G3: KRH), negatively charged groups (G4: DE) and uncharged groups (G5: STCPNQ). The GAAC descriptor is the frequency of each amino acid group, which is defined as $$\begin{equation} f(t)=\frac{N(t)}{N},g\in \left\{g1,g2,g3,g4,g5\right\}, \end{equation}$$(3) $$\begin{equation} N\left({g}_t\right)=\sum N(t),t\in g, \end{equation}$$(4) where |$N(g)$| is the number of amino acids in group |$g$|⁠, |$t$| is the number of amino acid type and |$N$| is the length of the peptide sequence. We extracted five protein features from GAAC (five descriptors) according to the default parameters of iFeature [37]. Autocorrelation The physicochemical properties of amino acids are the most intuitive characteristics of biochemical reactions, and they have been widely used in bioinformatics research. Pse-in-One 2.0 [28] extracted 21 kinds of protein physical and chemical properties from autocorrelation, and the parameter lag was set to 2. Merely studying the physical and chemical properties of amino acids can only achieve moderate results. Therefore, the sequential characteristics of the sequence should be considered. On the basis of the autocorrelation code, the original numerical sequence, the first adjacent sequences, and the second adjacent sequences were calculated, and finally, the statistical characteristics were calculated to represent the peptide. The four statistical characteristics (mean, variance, skewness and kurtosis) can be defined as follows: $$\begin{equation} \overline{P}=\frac{1}{N}{\sum}_{i=1}^N{P}_i, \end{equation}$$(5) $$\begin{equation} {\sigma}^2=\frac{1}{N}{\sum}_{i=1}^N{\left({p}_i-\overline{p}\right)}^2, \end{equation}$$(6) $$\begin{equation} {P}_s=\frac{\frac{1}{N}{\sum}_{i=1}^N{\left({p}_i-\overline{p}\right)}^3}{\sigma^3}, \end{equation}$$(7) $$\begin{equation} {P}_k=\frac{\frac{1}{N}{\sum}_{i=1}^N{\left({p}_i-\overline{p}\right)}^4}{\sigma^2}. \end{equation}$$(8) After the four formulas were processed, 12 protein characteristics can be obtained. AC_P was used to represent these 12-dimensional vectors. Finally, the output of the three different feature representation methods was combined, and each peptide sequence can be represented as a |$1\times 37$| dimensional joint feature. Taking QSP400 data as an example, a |$400\times 37$| 2D matrix can be obtained. The feature matrix was reshaped into a |$400\times 37\times 1$| 3D vector to facilitate the input of the model proposed in this paper for training. For Long short-term memory (LSTM), reshaping the dimension was also required before data is input into the model, but for supported vector machine, this step was omitted. Method The model proposed in this paper mainly comprised a convolution stage and a bidirectional LSTM stage. Figure 4 Open in new tabDownload slide The influence of parameters on ACC/loss of models. (A) Fix the parameter |$lr=0.0035$| for CPP, |$lr=0.0025$| for QSP and evaluate the influence of parameter epoch. (B) Fix the parameter epoch = 32, evaluate the influence of parameter |$lr$|⁠. Figure 4 Open in new tabDownload slide The influence of parameters on ACC/loss of models. (A) Fix the parameter |$lr=0.0035$| for CPP, |$lr=0.0025$| for QSP and evaluate the influence of parameter epoch. (B) Fix the parameter epoch = 32, evaluate the influence of parameter |$lr$|⁠. CNN Convolutional nerves can effectively mine the potential semantic information of the text by using multiple convolution kernels to convolve the word vectors of the text [39]. Studies have shown that Convolutional Neural Networks (CNN) can also effectively extract protein sequence information [40]. Specifically, a convolutional neural network can use multiple different types of convolution kernels to extract multiple local features. The feature matrix is reshaped into a |$400\times 37\times 1$| 3D vector, and this feature matrix is used as the input of CNN. The CNN used in our framework includes convolutional, activation and normalization layers. For the convolutional layer, we used four different types of convolution kernels to slide on the feature matrix. The output of the convolutional layer is the extracted features, which are processed by the batch normalization and activation layers. The dropout layer is also added to prevent over-fitting. For example, for the sequence |$P=({p}_1,{p}_2,{p}_3,\dots{p}_n),$| after vectorization |${p}_i=V({w}_i)$|⁠, each vector is converted into a d-dimensional vector to form a vector matrix. The feature extraction formula is as follows: $$\begin{equation} {c}_i={f}_1\left({f}_2\left({F}_k\times V\left({w}_{i:i+k\hbox{-} 1}\right)+b\right)\right), \end{equation}$$(9) where |${c}_i$| represents the local eigenvalue obtained after a convolution calculation, |${f}_2$| represents the batch normalization layer, |${f}_1$| represents a ReLU function, |${F}_k$| represents the filter of |$k\times d$|dimension and |$b$| represents the offset value. BiLSTM LSTM neural network is a special type of recurrent neural network [41, 42]; in the language model, the probability of the next word can be predicted based on a certain range of context information, which effectively solves the problem of the disappearance or explosion of the Recurrent Neural Network (RNN) gradient. The core of LSTM is multi-door cooperation, comprising an input gate, forget gate, input gate and memory cell, which can encode the input information at each moment. The behavior of each memory cell is controlled by a gate, which controls whether the information is saved; if it is saved, it is 1, and otherwise, it is 0. Specifically, forget gate |$f$| controls whether current state information is saved, input gate |$i$|controls whether input information is saved, and output gate |$o$| controls whether new cell information is outputted. At time node |$t$|⁠, after data enter the control unit, through calculation, LSTM can choose to remember or forget some information, thereby controlling the output of information. This status information can be transferred to the next time |$t+1$|⁠, and the calculation formula is as follows: $$\begin{equation} {\displaystyle \begin{array}{l}{i}_t=\sigma \left({W}_{ix}{x}_t+{W}_{ih}{h}_{t-1}\right)\\{}{f}_t=\sigma \left({W}_{fx}{x}_t+{W}_{fh}{h}_{t-1}\right)\\{}{o}_t=\sigma \left({W}_{ox}{x}_t+{W}_{oh}{h}_{t-1}\right)\\{}{c}_t={f}_t\odot{c}_{t-1}+{i}_t\odot \tanh \left({W}_{cx}{x}_t+{W}_h{h}_{t-1}\right)\\{}{h}_t={c}_t\odot{o}_t.\end{array}} \end{equation}$$(10) In the formula, |$x$|represents the input vector, |$h$| represents the output vector, |$\odot$| represents the dot product operator, |$W$| represents the weight matrix and |$\sigma$| represents the sigmoid function. However, ordinary LSTM can only process information in one direction of the sequence but not in the other direction [43]. The bidirectional RNN model was proposed by Schmidhuber in 1997 [42]. The purpose was to solve the problem involving the inability of the unidirectional RNN to process subsequent information. BiLSTM is a further extension on the basis of LSTM. One-way LSTM is connected before and after each training sequence, and the two one-way LSTMs are connected to the same layer. Feature information is extracted from the front and back directions. In this experiment, the number of hidden layer neurons in BiLSTM is 32. Other parameters are set to default values. CNN and BiLSTM were fused together as a predictor of therapeutic peptides to make better use of the local and global information of peptide sequence. The fusion model of CNN and BiLSTM is shown on the right in Figure 1. Our system is windows10, using Intel’s tenth-generation processor. Usually, the CNN-BiLSTM model is trained in about 80 seconds based on the Keras framework only with the help of the Central Processing Unit (CPU). Results Evaluation metrics In this study, we propose a new high-efficiency feature representation based on the deep learning CNN-BiLSTM model to predict therapeutic peptides. We used 5-fold cross-validation to evaluate the performance of ITP-Pred and other comparison models. The 5-fold cross-validation was used to randomly divide the data into five equal parts, the 4-fold set data were used as training data and the remaining 1-fold was used as test data. The average of the five results was used as the final evaluation value. The experiment selected accuracy (Acc), sensitivity (Sens), specificity (Spec), precision (Prec) and the Matthew correlation coefficient (MCC) as evaluation metrics. $$\begin{equation} \mathrm{Acc}=\frac{\mathrm{TN}+\mathrm{TP}}{\mathrm{TN}+\mathrm{TP}+\mathrm{FN}+\mathrm{FP}}, \end{equation}$$(11) $$\begin{equation} \mathrm{Sens}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \end{equation}$$(12) $$\begin{equation} \mathrm{Spec}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}, \end{equation}$$(13) $$\begin{equation} \mathrm{Prec}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \end{equation}$$(14) $$\begin{equation} \mathrm{MCC}=\frac{\mathrm{TP}\times \mathrm{TN}-\mathrm{FP}\times \mathrm{FN}}{\sqrt{\left(\mathrm{TP}+\mathrm{FP}\right)\left(\mathrm{TP}+\mathrm{FN}\right)\left(\mathrm{TN}+\mathrm{FP}\right)\left(\mathrm{TN}+\mathrm{FN}\right)}}, \end{equation}$$(15) where TN indicates the true negative number, TP denotes the true positive number, FN represents the false negative number and FP is the false positive number. The receiver operating characteristic (ROC) and the area under the curve (AUC) were also adopted to evaluate the performance. Parameter setting One of the most difficult aspects of training deep learning is the number of parameters processed. The first is the number of samples selected for one training. If the batch size is too small, the network cannot converge, and if it is too large, it may cause a memory explosion. Here, the range of batch size is [16,64], and the overall effect is best when the value is 32. The corresponding epoch was set to 32 as the most appropriate. An important concept of deep learning is batch normalization, which can make our parameter search problem easier and stabilize the neural network’s choice of hyperparameters. The parameters here suggest that the default value works well. Dropout means that during the training process of the deep learning network, the neural network unit is temporarily dropped from the network according to a certain probability. The value range of the dropout parameter is [0.1, 0.5], and 0.2 is the most suitable. The above parameters are the same as the model trained on the QSP400 and CPP740 datasets. Finally, the Adam optimization algorithm was used, and the learning rate (⁠|$lr$|⁠) through continuous adjustments, the final value range of the |$lr$| is [0.0005, 0.007], the step size is 0.0001, and other parameters are set to default values. For CPP740, it is optimal when the |$lr$| is 0.0035; for QSP400, it is optimal when the |$lr$| is 0.0025. Thus, we set batch size to 32, epoch to 32, dropout to 0.2, and add batch normalization layer, for CPP740, |$lr$| to 0.0035, for QSP400, |$lr$| to 0.0025. In addition, the tuning process of epoch and |$lr$| is shown in Figure 4. Overall performances of cross-validation We trained our model ITP-Pred on the benchmark CPP740 and QSP400 datasets to evaluate its ability to predict therapeutic peptide. Tables 1 and 2 show information on the 5-fold cross-validation. Table 3 shows the 5-fold cross-validation on the CPP740 dataset, and the average of Acc is 89.0%, the average of Sens is 86.3%, the average of Spec is 93.2%, the average of Prec is 84.9%, and the average of MCC is 78.7%. ITP-Pred shows excellent therapeutic peptide identification ability on the CPP740 dataset. Table 4 shows the 5-fold cross-validation on the QSP400 dataset. The average of Acc is 87.0%, the average of Sens is 85.5%, the average of Spec is 88.0%, the average of Prec is 84.5%, and the average of MCC is 72.9%. In general, the performance of a deep learning model is directly proportional to the scale of the data. Although the dataset of QSP400 is not very large, it still performs well in the case of 5-fold cross-validation. According to the experimental results of the benchmark datasets QSP400 and CPP740, our model has an outstanding ability to predict therapeutic peptides. Table 3 Five-fold cross-validation details in the CPP740 dataset Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 91.2 93.0 89.2 93.2 82.5 2 87.8 81.8 97.3 78.4 77.1 3 90.1 88.5 93.2 87.8 81.2 4 86.5 84.6 89.2 83.8 73.1 5 89.2 83.7 97.3 81.1 79.4 Average 89.0 86.3 93.2 84.9 78.7 Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 91.2 93.0 89.2 93.2 82.5 2 87.8 81.8 97.3 78.4 77.1 3 90.1 88.5 93.2 87.8 81.2 4 86.5 84.6 89.2 83.8 73.1 5 89.2 83.7 97.3 81.1 79.4 Average 89.0 86.3 93.2 84.9 78.7 Open in new tab Table 3 Five-fold cross-validation details in the CPP740 dataset Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 91.2 93.0 89.2 93.2 82.5 2 87.8 81.8 97.3 78.4 77.1 3 90.1 88.5 93.2 87.8 81.2 4 86.5 84.6 89.2 83.8 73.1 5 89.2 83.7 97.3 81.1 79.4 Average 89.0 86.3 93.2 84.9 78.7 Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 91.2 93.0 89.2 93.2 82.5 2 87.8 81.8 97.3 78.4 77.1 3 90.1 88.5 93.2 87.8 81.2 4 86.5 84.6 89.2 83.8 73.1 5 89.2 83.7 97.3 81.1 79.4 Average 89.0 86.3 93.2 84.9 78.7 Open in new tab Table 4 Five-fold cross-validation details in the QSP400 dataset Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 87.4 84.6 82.5 85.0 67.5 2 85.0 79.2 95.0 75.0 71.4 3 93.8 97.3 90.0 97.5 87.8 4 88.8 87.8 90.0 87.5 77.5 5 80.0 78.6 82.5 77.5 60.1 Average 87.0 85.5 88.0 84.5 72.9 Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 87.4 84.6 82.5 85.0 67.5 2 85.0 79.2 95.0 75.0 71.4 3 93.8 97.3 90.0 97.5 87.8 4 88.8 87.8 90.0 87.5 77.5 5 80.0 78.6 82.5 77.5 60.1 Average 87.0 85.5 88.0 84.5 72.9 Open in new tab Table 4 Five-fold cross-validation details in the QSP400 dataset Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 87.4 84.6 82.5 85.0 67.5 2 85.0 79.2 95.0 75.0 71.4 3 93.8 97.3 90.0 97.5 87.8 4 88.8 87.8 90.0 87.5 77.5 5 80.0 78.6 82.5 77.5 60.1 Average 87.0 85.5 88.0 84.5 72.9 Fold set . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . 1 87.4 84.6 82.5 85.0 67.5 2 85.0 79.2 95.0 75.0 71.4 3 93.8 97.3 90.0 97.5 87.8 4 88.8 87.8 90.0 87.5 77.5 5 80.0 78.6 82.5 77.5 60.1 Average 87.0 85.5 88.0 84.5 72.9 Open in new tab Comparison and verification Comparative experiment on benchmark dataset We selected some classic machine learning algorithms and the latest methods to compare with ITP-Pred on the benchmark CPP740 and QSP400 datasets to prove the superiority of ITP-Pred. Here, we chose SVM, LSTM network and PPTPP method proposed by Zhang [31], with the same evaluation criteria. Figure 5 and Table 5 show the details of the comparison results. In the CPP740 dataset, our method ITP-Pred significantly outperformed other methods with an Acc of 89.0%, a Sens of 87.3%, a Spec of 95.7%, a Prec of 86.0%, an MCC of 82.1%, and an AUC of 96.2%. AUC was 13.8% higher than the latest method PPTPP [31], and more than 8.6% higher than LSTM. In the dataset QSP400, ITP-Pred also performed remarkably with an Acc of 87.0%, a Sens of 85.5%, a Spec of is 88.0%, a Prec of is 84.5%, an MCC of is 72.9%, and an AUC of 91.8%. ITP-Pred’s Acc, Sens, Spec, MCC and AUC were all higher than the latest method PPTPP. AUC and Acc increased by 0.3 and 3.5%, respectively. The experimental results show that our model is suitable for identifying and predicting therapeutic peptides. ITP-Pred is a competitive model used to predict therapeutic peptides and accelerate related research. The comparative experiment results proved our hypothesis. Figure 5 Open in new tabDownload slide Comparison of ITP-Pred with SVM, LSTM and PPTPP in the benchmark dataset QSP400 (A) and CPP740 (B). Figure 5 Open in new tabDownload slide Comparison of ITP-Pred with SVM, LSTM and PPTPP in the benchmark dataset QSP400 (A) and CPP740 (B). Table 5 Comparative experiment on benchmark dataset Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 76.6 71.7 88.4 64.9 64.9 76.6 LSTM 80.0 83.2 77.6 82.4 61.7 87.6 PPTPP 74.9 71.6 78.1 76.0 49.8 82.4 ITP-Pred 89.0 86.3 93.2 84.9 78.7 96.2 QSP400 SVM 66.0 63.3 76.0 56.0 32.1 66.0 LSTM 56.5 67.8 27.0 86.0 16.8 65.3 PPTPP 83.5 85.5 85.0 85.1 70.5 91.5 ITP-Pred 87.0 85.5 88.0 84.5 72.9 91.8 Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 76.6 71.7 88.4 64.9 64.9 76.6 LSTM 80.0 83.2 77.6 82.4 61.7 87.6 PPTPP 74.9 71.6 78.1 76.0 49.8 82.4 ITP-Pred 89.0 86.3 93.2 84.9 78.7 96.2 QSP400 SVM 66.0 63.3 76.0 56.0 32.1 66.0 LSTM 56.5 67.8 27.0 86.0 16.8 65.3 PPTPP 83.5 85.5 85.0 85.1 70.5 91.5 ITP-Pred 87.0 85.5 88.0 84.5 72.9 91.8 Note: The best performance in each dataset is given in boldface Open in new tab Table 5 Comparative experiment on benchmark dataset Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 76.6 71.7 88.4 64.9 64.9 76.6 LSTM 80.0 83.2 77.6 82.4 61.7 87.6 PPTPP 74.9 71.6 78.1 76.0 49.8 82.4 ITP-Pred 89.0 86.3 93.2 84.9 78.7 96.2 QSP400 SVM 66.0 63.3 76.0 56.0 32.1 66.0 LSTM 56.5 67.8 27.0 86.0 16.8 65.3 PPTPP 83.5 85.5 85.0 85.1 70.5 91.5 ITP-Pred 87.0 85.5 88.0 84.5 72.9 91.8 Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 76.6 71.7 88.4 64.9 64.9 76.6 LSTM 80.0 83.2 77.6 82.4 61.7 87.6 PPTPP 74.9 71.6 78.1 76.0 49.8 82.4 ITP-Pred 89.0 86.3 93.2 84.9 78.7 96.2 QSP400 SVM 66.0 63.3 76.0 56.0 32.1 66.0 LSTM 56.5 67.8 27.0 86.0 16.8 65.3 PPTPP 83.5 85.5 85.0 85.1 70.5 91.5 ITP-Pred 87.0 85.5 88.0 84.5 72.9 91.8 Note: The best performance in each dataset is given in boldface Open in new tab Independent experiments Here, we used an independent verification set different from the training set and test set for independent verification to evaluate ITP-Pred performance. The 184 CPPs and 40 QSPs independent verification sets are equally divided into positive samples and negative samples, and then input the trained model for independent verification, and compare with the three models of SVM, LSTM and PPTPP [31]. Table 6 shows the comparison results of the independent verification of the four methods. On the independent verification set of CPP740, AUC and Acc reached 99.8 and 97.5%, respectively. On the independent verification set of QSP400, AUC and Acc reached 98.9 and 95.1%, respectively. Compared with the latest method PPTPP [31], AUC was improved by 3.3 and 4.9% on the QSP400 and CPP740 datasets, respectively. The results of independent verification experiments confirm that our model has reliable predictive performance. Table 6 Comparison of independent verification of the four methods Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 70.7 62.7 80.4 60.9 42.1 70.7 LSTM 85.3 89.2 80.4 90.2 71.0 96.3 PPTPP – – – – – 96.5 ITP-Pred 95.1 92.8 97.8 92.4 90.4 98.9 QSP400 SVM 65.0 62.5 75.0 55.0 30.6 65.0 LSTM 52.5 1.0 5.0 1.0 16.0 66.8 PPTPP – – – – – 94.0 ITP-Pred 97.5 95.2 1.0 95.0 95.1 99.8 Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 70.7 62.7 80.4 60.9 42.1 70.7 LSTM 85.3 89.2 80.4 90.2 71.0 96.3 PPTPP – – – – – 96.5 ITP-Pred 95.1 92.8 97.8 92.4 90.4 98.9 QSP400 SVM 65.0 62.5 75.0 55.0 30.6 65.0 LSTM 52.5 1.0 5.0 1.0 16.0 66.8 PPTPP – – – – – 94.0 ITP-Pred 97.5 95.2 1.0 95.0 95.1 99.8 Note: The best performance in each dataset is given in boldface. Open in new tab Table 6 Comparison of independent verification of the four methods Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 70.7 62.7 80.4 60.9 42.1 70.7 LSTM 85.3 89.2 80.4 90.2 71.0 96.3 PPTPP – – – – – 96.5 ITP-Pred 95.1 92.8 97.8 92.4 90.4 98.9 QSP400 SVM 65.0 62.5 75.0 55.0 30.6 65.0 LSTM 52.5 1.0 5.0 1.0 16.0 66.8 PPTPP – – – – – 94.0 ITP-Pred 97.5 95.2 1.0 95.0 95.1 99.8 Datasets . Model . Acc (%) . Sens (%) . Spec (%) . Prec (%) . MCC (%) . AUC (%) . CPP740 SVM 70.7 62.7 80.4 60.9 42.1 70.7 LSTM 85.3 89.2 80.4 90.2 71.0 96.3 PPTPP – – – – – 96.5 ITP-Pred 95.1 92.8 97.8 92.4 90.4 98.9 QSP400 SVM 65.0 62.5 75.0 55.0 30.6 65.0 LSTM 52.5 1.0 5.0 1.0 16.0 66.8 PPTPP – – – – – 94.0 ITP-Pred 97.5 95.2 1.0 95.0 95.1 99.8 Note: The best performance in each dataset is given in boldface. Open in new tab Analysis of the influence of different feature subsets on experimental performance To explore the impact of different feature subsets in the feature set on the overall classification prediction performance, we constructed the three different feature subsets, namely, 25-dimensional the AAC and GAAC, 32-dimensional the AAC and the physical and chemical properties of the AC_P, and 17-dimensional the GAAC and the physical and chemical properties of the AC_P to carry out the comparison experiment between three feature subsets and the feature of ITP-Pred. Figure 6A and B, respectively, show the ROC curves on the datasets CPP740 and QSP400. Comparison indicated that on the CPP740 and QSP400 datasets, the combination of the three features proposed in this paper (AAC, GAAC and AC_P) contributes the most to the model, which is 1.3 and 4.1% higher than AAC plus GAAC, respectively, suggesting that effective combination of multiple features can improve model performance. Figure 6 Open in new tabDownload slide Feature combination on datasets CPP740 (A) and QSP400 (B). Our proposed method is a combination of three features, namely, AAC and GAAC, and the encoded physical and chemical properties of AC_P. Figure 6 Open in new tabDownload slide Feature combination on datasets CPP740 (A) and QSP400 (B). Our proposed method is a combination of three features, namely, AAC and GAAC, and the encoded physical and chemical properties of AC_P. Analysis of the importance and physicochemical properties of AC_P In ‘Dataset section’, we introduced the physical and chemical properties of the AC_P as one of ITP-Pred’s features. We discussed the sequence itself and also supplemented the first-level adjacent difference sequence and the second-level adjacent difference sequence. For these three encoded sequences, four statistics were calculated to represent our peptide sequence. In order to illustrate the nature and rationality of the encoded features, we conducted an interpretability analysis on AC_P. Each point in Figure 7A and C represents a sample point of the CPP740 and QSP400 datasets, and a good feature should be a feature that spreads the sample. Figure 7B and D shows the features’ average impact on model output magnitude based on CPP740 and QSP400 datasets. For two datasets in this article, the top five important features are the same, and they are Feature 6, Feature 1, Feature 3, Feature 7 and Feature 10, which cover the important statistical significance of the three sequences encoded by AC_P. Feature 6 is the most important feature, which represents the variance characteristics of the first-level adjacent difference sequence. Feature 1 represents the mean characteristic of the original sequence. Feature 3 represents the variance characteristic of the original sequence. Feature 7 represents the kurtosis characteristic of the first adjacent difference sequence. Feature 10 represents the variance characteristic of the second adjacent difference sequence. The experimental results show that compared with other features, the variance and kurtosis of the first adjacent difference sequence, the mean value and kurtosis of the original sequence, and the variance of the second adjacent difference sequence are effective features for coding sequence. This finding illustrates the statistical significance of sequence coding and provides a supplement to the extraction of sequence features. Figure 7 Open in new tabDownload slide Feature interpretability on datasets QSP400 and CPP740. (A) Impact the model on the dataset CPP740. (B) Feature importance map on datasets CPP740. (C) Impact of the model on the dataset QSP400. (D) Feature importance map on datasets QSP400. Figure 7 Open in new tabDownload slide Feature interpretability on datasets QSP400 and CPP740. (A) Impact the model on the dataset CPP740. (B) Feature importance map on datasets CPP740. (C) Impact of the model on the dataset QSP400. (D) Feature importance map on datasets QSP400. To explain which physicochemical properties of AC_P play a decisive role, so as to help us intuitively judge the properties of the peptide, we conducted an interpretability analysis on the physical and chemical properties used by AC_P. The results are shown in Table 7. For QSP, physical and chemical properties with the top three SHapley Additive exPlanation (SHAP) values are PALJ810106, KRIW710101 and NADH010105; for CPP, physical and chemical properties with the top three SHAP values are GRAR740101, RACS820104 and NADH010105. Table 7 Dominant physical and chemical properties of peptides Peptides . AA index . SHAP value . Description . QSP PALJ810106 0.35 Normalized frequency of turn from CF KRIW710101 0.34 Side chain interaction parameter NADH010105 0.33 Effective partition energy CPP GRAR740101 0.25 Composition RACS820104 0.22 Average relative fractional occurrence in EL(i) NADH010105 0.21 Effective partition energy Peptides . AA index . SHAP value . Description . QSP PALJ810106 0.35 Normalized frequency of turn from CF KRIW710101 0.34 Side chain interaction parameter NADH010105 0.33 Effective partition energy CPP GRAR740101 0.25 Composition RACS820104 0.22 Average relative fractional occurrence in EL(i) NADH010105 0.21 Effective partition energy Open in new tab Table 7 Dominant physical and chemical properties of peptides Peptides . AA index . SHAP value . Description . QSP PALJ810106 0.35 Normalized frequency of turn from CF KRIW710101 0.34 Side chain interaction parameter NADH010105 0.33 Effective partition energy CPP GRAR740101 0.25 Composition RACS820104 0.22 Average relative fractional occurrence in EL(i) NADH010105 0.21 Effective partition energy Peptides . AA index . SHAP value . Description . QSP PALJ810106 0.35 Normalized frequency of turn from CF KRIW710101 0.34 Side chain interaction parameter NADH010105 0.33 Effective partition energy CPP GRAR740101 0.25 Composition RACS820104 0.22 Average relative fractional occurrence in EL(i) NADH010105 0.21 Effective partition energy Open in new tab PALJ810106 represents normalized frequency of turn from CF. The normalized propensity parameter is defined as the ratio of the frequency of amino acids in the secondary structure to its frequency in the entire sample, which directly promotes the development of secondary structure prediction methods that only need to know the amino acid sequence [44]. GRAR740101 means frequency of occurrence in protein sequences, which is one of six ‘natural’ features of amino acids. Phasit et al. [45] reports physicochemical properties GRAR740101 having a Gini index value of greater than 10. KRIW710101 represents side chain interaction parameter. Lysine salt bridge within an α-helical peptide is currently the strongest α-helix side-chain interaction measured [46]. Studies have shown that peptides can be stabilized by salt bridge interactions with appropriately placed positively charged side chains. The number of peptide drugs entering the pharmaceutical market is increasing, and enhancing the stability of peptide drugs will undoubtedly improve the applicability of the market. NADH010105 means effective partition energy, which belongs to the category of hydrophobicity. Hydrophobic interaction is the effective attraction between non-polar groups in water, and it plays a central role in the stability of mesoscopic assembly and biological structure in aqueous environments. RACS820104 denotes average relative fractional occurrence in EL(i), which belongs to the conformational attribute category of α and turn propensities. All the physical and chemical properties in Table 7 can directly or indirectly effectively quantify the secondary structure of therapeutic peptides. This finding is consistent with previous research results, that is, the physical and chemical properties of amino acids are crucial in the study of protein secondary structure and are helpful in detecting the macroscopic properties of protein. In addition, in order to further study the significance of these features for peptides, we also conducted experiments on key feature interaction effects. Since the SHAP can be used to obtain the dependence effect after considering the impact of a single feature, we separately capture the interaction information of the important features of the CPP and the QSP. For CPPs, Figure 8A shows that GRAR740101 and NADH010105 exist powerful interaction relationship. For QSPs, KRIW710101 and NADH010105 show strong dependency in Figure 8B. Note that Figure 8 is not a causal model. GRAR740101 and NADH010105 scatter plot reveals an inverse trend, whereas KRIW710101 and NADH010105 scatter plot shows a positive relationship. The amino acid index cluster analysis of protein function proposed by Kenta et al. [47] clearly shows that GRAR740101, KRIW710101 and NADH010105 are hydrophobic types, which is completely consistent with our experimental results. Figure 8 Open in new tabDownload slide SHAP feature dependency graph with interactive visualization. (A) The powerful interaction relationship of GRAR740101 and NADH010105. (B) The strong dependency of KRIW710101 and NADH010105. Figure 8 Open in new tabDownload slide SHAP feature dependency graph with interactive visualization. (A) The powerful interaction relationship of GRAR740101 and NADH010105. (B) The strong dependency of KRIW710101 and NADH010105. Conclusion In this research, an interpretable model ITP-Pred using an effective feature representation to predict potential therapeutic peptides was proposed, and experimental evaluation and feature interpretability were performed. We propose a new feature representation method, which combines AAC, GAAC, and physical and chemical properties with numerical processing (AC_P), and we input it into the CNN-BiLSTM model to automatically learn recognition of therapeutic peptides. ITP-Pred performs 5-fold cross-validation and independent validation on the CPP740 and QSP400 datasets, and compares them with SVM, PPTPP and LSTM. The experimental results show that ITP-Pred can capture the key information of the peptide sequence and has excellent accuracy and effectiveness. We constructed three feature subsets for comparison through the pairwise combination of ITP-Pred features to illustrate the performance and biological significance of the feature representation of ITP-Pred. The experimental results showed that the feature representation of our model contributed the most to the model. We also conducted an interpretable analysis of the importance and physicochemical properties of ITP-Pred’s feature AC_P, which provided an important supplement for future peptide sequence research. We provide free and open code and data for others to copy. In future work, we will consider studying other peptide tasks, and we will further expand the use of feature representation methods and the CNN-BiLSTM model to prove that ITP-Pred has a wide range of applicability. Key Point Propose a novel method for predicting therapeutic peptides with the use of protein sequences only. Compared with existing models, our model is more efficient with regard to computational speed. Interpret the feature of ITP-Pred from two aspects by using SHAP. Funding The work was supported in part by National Natural Science Foundation of China (61872309, 61972138, 62002111), in part by the Fundamental Research Funds for the Central Universities (531118010355), in part by China Postdoctoral Science Foundation (2019M662770), in part by University-level Key projects of Anhui University of Science and Technology (QN2019102), and in part by Hunan Provincial Natural Science Foundation of China (2020JJ4215). Lijun Cai is a professor at Hunan University. His research interests include bioinformatics and data mining. Li Wang is a graduate student at Hunan University. Her research interest is bioinformatics. Xiangzheng Fu is a doctor at Hunan University. His research interest is classification of proteins in bioinformatics. Chenxing Xia is an assistant professor at Anhui University of Science and Technology. His research interests are computer vision and deep learning. Xiangxiang Zeng is a professor at Hunan University. His research interests include biocomputing and bioinformatics. Quan Zou is a professor at The University of Electronic Science and Technology of China. His research interests include bioinformatics and biocomputing. References 1. Gomes B , Augusto MT, Felício MR, et al. Designing improved active peptides for therapeutic approaches against infectious diseases . Biotechnol Adv 2018 ; 36 ( 2 ): 415 – 29 . Google Scholar Crossref Search ADS PubMed WorldCat 2. Ahrens VM , Bellmann-Sickert K, Beck-Sickinger AG. Peptides and peptide conjugates: therapeutics on the upward path . Future Med Chem 2012 ; 4 ( 12 ): 1567 – 86 . Google Scholar Crossref Search ADS PubMed WorldCat 3. Kim HO , Kahn M. A merger of rational drug design and combinatorial chemistry: development and application of peptide secondary structure mimetics . Comb Chem High Throughput Screen 2000 ; 3 ( 3 ): 167 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 4. Torres MDT , Sothiselvam S, Lu TK, et al. Peptide design principles for antimicrobial applications . J Mol Biol 2019 ; 431 ( 18 ): 3547 – 67 . Google Scholar Crossref Search ADS PubMed WorldCat 5. Wei L , Hu J, Li F, et al. Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms . Briefings in Bioinformatics 2020 ; 21 ( 1 ): 106 – 19 . Google Scholar OpenURL Placeholder Text WorldCat 6. Miller MB , Bassler BL. Quorum sensing in bacteria . Annu Rev Microbiol 2001 ; 55 : 165 – 99 . Google Scholar Crossref Search ADS PubMed WorldCat 7. Chen X , Schauder S, Potier N, et al. Structural identification of a bacterial quorum-sensing signal containing boron . Nature 2002 ; 415 ( 6871 ): 545 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 8. Wynendaele E , Bronselaer A, Nielandt J, et al. Quorumpeps database: chemical space, microbial origin and functionality of quorum sensing peptides . Nucleic Acids Res 2013 ; 41 : D655 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 9. Hansen A , Schäfer I, Knappe D, et al. Intracellular toxicity of proline-rich antimicrobial peptides shuttled into mammalian cells by the cell-penetrating peptide penetratin . Antimicrob Agents Chemother 2012 ; 56 ( 10 ): 5194 – 201 . Google Scholar Crossref Search ADS PubMed WorldCat 10. Boisguerin P , Giorgi JM, Barrere-Lemaire S. CPP-conjugated anti-apoptotic peptides as therapeutic tools of ischemia-reperfusion injuries . Curr Pharm Des 2013 ; 19 ( 16 ): 2970 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 11. Shi NQ , Gao W, Xiang B, et al. Enhancing cellular uptake of activable cell-penetrating peptide-doxorubicin conjugate by enzymatic cleavage . Int J Nanomedicine 2012 ; 7 : 1613 – 21 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 12. Li Y , Zheng X, Cao Z, et al. Self-assembled peptide (CADY-1) improved the clinical application of doxorubicin . Int J Pharm 2012 ; 434 ( 1–2 ): 209 – 14 . Google Scholar Crossref Search ADS PubMed WorldCat 13. Lehto T , Kurrikoff K, Langel U. Cell-penetrating peptides for the delivery of nucleic acids . Expert Opin Drug Deliv 2012 ; 9 ( 7 ): 823 – 36 . Google Scholar Crossref Search ADS PubMed WorldCat 14. Margus H , Padari K, Pooga M. Cell-penetrating peptides as versatile vehicles for oligonucleotide delivery . Mol Ther 2012 ; 20 ( 3 ): 525 – 33 . Google Scholar Crossref Search ADS PubMed WorldCat 15. Presente A , Dowdy SF. PTD/CPP peptide-mediated delivery of siRNAs . Curr Pharm Des 2013 ; 19 ( 16 ): 2943 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 16. Kang Z , Ding G, Meng Z, et al. The rational design of cell-penetrating peptides for application in delivery systems . Peptides 2019 ; 121 :170149. Google Scholar OpenURL Placeholder Text WorldCat 17. Fu X , Cai L, Zeng X, et al. StackCPPred: a stacking and pairwise energy content-based prediction of cell-penetrating peptides and their uptake efficiency . Bioinformatics 2020 ; 36 ( 10 ): 3028 – 34 . Google Scholar Crossref Search ADS PubMed WorldCat 18. Loureiro JA , Coelho MAN, Rocha S, et al. Design of potential therapeutic peptides and carriers to inhibit amyloid β peptide aggregation. In: 2012 IEEE 2nd Portuguese Meeting in Bioengineering (ENBENG) , 2012 . 19. Ramaker K , Henkel M, Krause T, et al. Cell penetrating peptides: a comparative transport analysis for 474 sequence motifs . Drug Deliv 2018 ; 25 ( 1 ): 928 – 37 . Google Scholar Crossref Search ADS PubMed WorldCat 20. Schmidt S , Adjobo-Hermans MJ, Kohze R, et al. Identification of short hydrophobic cell-penetrating peptides for cytosolic peptide delivery by rational design . Bioconjug Chem 2017 ; 28 ( 2 ): 382 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 21. Wynendaele E , Gevaert B, Stalmans S, et al. Exploring the chemical space of quorum sensing peptides . Biopolymers 2015 ; 104 ( 5 ): 544 – 51 . Google Scholar Crossref Search ADS PubMed WorldCat 22. Attique M , Manavalan B, Shin TH, et al. Prediction of therapeutic peptides using machine learning: computational models, datasets, and feature encodings . IEEE Access 2020 ; 8 : 148570 – 98 . Google Scholar Crossref Search ADS WorldCat 23. Altschul SF , Gish W, Miller W, et al. Basic local alignment search tool . J Mol Biol 1990 ; 215 ( 3 ): 403 – 10 . Google Scholar Crossref Search ADS PubMed WorldCat 24. Akbar S , Hayat M, Iqbal M, et al. iACP-GAEnsC: evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space . Artif Intell Med 2017 ; 79 : 62 – 70 . Google Scholar Crossref Search ADS PubMed WorldCat 25. Altschul SF , Madden TL, Schäffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs . Nucleic Acids Res 1997 ; 25 ( 17 ): 3389 – 402 . Google Scholar Crossref Search ADS PubMed WorldCat 26. Wei L , Zhou C, Chen H, et al. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides . Bioinformatics 2018 ; 34 ( 23 ): 4007 – 16 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 27. Wu C , Gao R, Zhang Y, et al. PTPD: predicting therapeutic peptides by deep learning and word2vec . BMC Bioinformatics 2019 ; 20 ( 1 ):456. Google Scholar OpenURL Placeholder Text WorldCat 28. Liu B , Liu F, Wang X, et al. Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences . Nucleic Acids Res 2015 ; 43 ( W1 ): W65 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 29. Saha S , Raghava G. Prediction of neurotoxins based on their function and source . In Silico Biol 2007 ; 7 ( 4–-5 ): 369 – 87 . Google Scholar PubMed OpenURL Placeholder Text WorldCat 30. Yi HC , You ZH, Zhou X, et al. ACP-DL: a deep learning long short-term memory model to predict anticancer peptides using high-efficiency feature representation . Mol Ther Nucleic Acids 2019 ; 17 : 1 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 31. Zhang Y , Zou Q. PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning . Bioinformatics 2020 ; 36 ( 13 ): 3982 – 7 . Google Scholar Crossref Search ADS PubMed WorldCat 32. Ettayapuram Ramaprasad AS , Singh S, Gajendra P. S R, et al. AntiAngioPred: a server for prediction of anti-angiogenic peptides . PLoS One 2015 ; 10 ( 9 ):e0136990. Google Scholar OpenURL Placeholder Text WorldCat 33. Li W , Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences . Bioinformatics 2006 ; 22 ( 13 ): 1658 – 9 . Google Scholar Crossref Search ADS PubMed WorldCat 34. UniProt C . UniProt: a worldwide hub of protein knowledge . Nucleic Acids Res 2019 ; 47 ( D1 ): D506 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 35. Wei L , Xing PW, Su R, et al. CPPred-RF: a sequence-based predictor for identifying cell-penetrating peptides and their uptake efficiency . J Proteome Res 2017 ; 16 ( 5 ): 2044 – 53 . Google Scholar Crossref Search ADS PubMed WorldCat 36. Bhasin M , Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition . J Biol Chem 2004 ; 279 ( 22 ): 23262 – 6 . Google Scholar Crossref Search ADS PubMed WorldCat 37. Chen Z , Zhao P, Li F, et al. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences . Bioinformatics 2018 ; 34 ( 14 ): 2499 – 502 . Google Scholar Crossref Search ADS PubMed WorldCat 38. Afridi TH , Khan A, Lee YS. Mito-GSAAC: mitochondria prediction using genetic ensemble classifier and split amino acid composition . Amino Acids 2012 ; 42 ( 4 ): 1443 – 54 . Google Scholar Crossref Search ADS PubMed WorldCat 39. Fu H , Niu Z, Zhang C, et al. Visual cortex inspired CNN model for feature construction in text analysis . Front Comput Neurosci 2016 ; 10 : 64 . Google Scholar Crossref Search ADS PubMed WorldCat 40. Jones DT , Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features . Bioinformatics 2018 ; 34 ( 19 ): 3308 – 15 . Google Scholar Crossref Search ADS PubMed WorldCat 41. Gers FA , Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM . Neural Comput 2000 ; 12 ( 10 ): 2451 – 71 . Google Scholar Crossref Search ADS PubMed WorldCat 42. Hochreiter S , Schmidhuber J. Long short-term memory . Neural Comput 1997 ; 9 ( 8 ): 1735 – 80 . Google Scholar Crossref Search ADS PubMed WorldCat 43. Grisoni F , Moret M, Lingwood R, et al. Bidirectional molecule generation with recurrent neural networks . J Chem Inf Model 2020 ; 60 ( 3 ): 1175 – 83 . Google Scholar Crossref Search ADS PubMed WorldCat 44. Palau J , Argos P, Puigdomenech P. Protein secondary structure. Studies on the limits of prediction accuracy . Int J Pept Protein Res 1982 ; 19 ( 4 ): 394 – 401 . Google Scholar Crossref Search ADS PubMed WorldCat 45. Charoenkwan P , Schaduangrat N, Nantasenamat C, et al. iQSP: a sequence-based tool for the prediction and analysis of quorum sensing peptides via Chou's 5-steps rule and informative physicochemical properties . Int J Mol Sci 2020 ; 21 ( 1 ): 75 . Google Scholar Crossref Search ADS WorldCat 46. Errington N , Doig AJ. A phosphoserine-lysine salt bridge within an alpha-helical peptide, the strongest alpha-helix side-chain interaction measured to date . Biochemistry 2005 ; 44 ( 20 ): 7553 – 8 . Google Scholar Crossref Search ADS PubMed WorldCat 47. Nakai K , Kidera A, Kanehisa M. Cluster analysis of amino acid indices for prediction of protein structure and function . Protein Eng 1988 ; 2 ( 2 ): 93 – 100 . Google Scholar Crossref Search ADS PubMed WorldCat © The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation JF - Briefings in Bioinformatics DO - 10.1093/bib/bbaa367 DA - 2020-12-14 UR - https://www.deepdyve.com/lp/oxford-university-press/itp-pred-an-interpretable-method-for-predicting-therapeutic-peptides-7VVl7YsKBx SP - 1 EP - 1 VL - Advance Article IS - DP - DeepDyve ER -