Datasets Downloads
All of these substrate data were mainly derived from the MEROPS database, an online information resource for proteases and their inhibitors (Rawlings et al., Nucleic Acids Res 2008, 36, D320-D325). To avoid over-training, sequence homology reduction within the training and testing datasets was performed in such a way that sequence identity between any two peptide sequences should not be larger than 70%.
We extracted 38 substrate dataset, which are protease-specific and species-specific. The following table describes the statistics of the substrate datasets we used to develop the iProt-Sub tool for predicting cleavage sites of multiple proteases. Each substrate dataset can be downloaded by clicking the hyperlink associated with each MEROPS ID of protease family in this table.
| Proteases name | Merops ID | Number of Substrates | Number of cleavage sites | Species | Family |
|---|---|---|---|---|---|
| Cathepsin D | A01.009 | 23 | 59 | Homo sapiens | Aspartic |
| Cathepsin D | A01.009 | 342 | 579 | Mus Musculus | |
| Cathepsin E | A01.010 | 655 | 1216 | Mus Musculus | |
| Cathepsin L | C01.032 | 17 | 63 | Homo sapiens | Cysteine |
| Calpain-1 | C02.001 | 30 | 61 | Homo sapiens | |
| Calpain-2 | C02.002 | 17 | 66 | Homo sapiens | |
| Caspase-1 | C14.001 | 47 | 53 | Mus Musculus | |
| Caspase-3 | C14.003 | 251 | 373 | Homo sapiens | |
| Caspase-7 | C14.004 | 48 | 64 | Homo sapiens | |
| Caspase-6 | C14.005 | 58 | 165 | Homo sapiens | |
| Caspase-8 | C14.009 | 37 | 56 | Homo sapiens | |
| Matrix Metallopeptidase-1 | M10.001 | 21 | 52 | Homo sapiens | Metallo |
| Matrix Metallopeptidase-8 | M10.002 | 23 | 85 | Homo sapiens | |
| Matrix Metallopeptidase-2 | M10.003 | 35 | 115 | Homo sapiens | |
| Matrix Metallopeptidase-9 | M10.004 | 43 | 290 | Homo sapiens | |
| Matrix Metallopeptidase-3 | M10.005 | 44 | 132 | Homo sapiens | |
| Matrix Metallopeptidase-7 | M10.008 | 42 | 142 | Homo sapiens | |
| Matrix Metallopeptidase-12 | M10.009 | 23 | 178 | Homo sapiens | |
| Matrix Metallopeptidase-13 | M10.013 | 23 | 90 | Homo sapiens | |
| Membrane-type Matrix Metallopeptidase-12 | M10.014 | 36 | 92 | Homo sapiens | |
| ADAMTS4 Peptidase | M12.221 | 13 | 50 | Homo sapiens | |
| Neprilysin | M13.001 | 19 | 67 | Homo sapiens | |
| Insulysin | M16.002 | 6 | 50 | Homo sapiens | |
| Granzyme B | S01.010 | 410 | 515 | Homo sapiens | Serine |
| Granzyme B | S01.010 | 77 | 88 | Mus Musculus | |
| Kallikrein-related Peptidase 5 | S01.017 | 31 | 59 | Homo sapiens | |
| Elastase-2 | S01.131 | 45 | 133 | Homo sapiens | |
| Granzyme A | S01.135 | 44 | 57 | Homo sapiens | |
| Granzyme B | S01.136 | 143 | 157 | Homo sapiens | |
| Granzyme B | S01.136 | 168 | 201 | Mus Musculus | |
| Granzyme M | S01.139 | 491 | 707 | Homo sapiens | |
| Plasmin | S01.233 | 42 | 89 | Homo sapiens | |
| Kallikrein-related Peptidase 4 | S01.251 | 78 | 80 | Homo sapiens | |
| Furin | S08.071 | 56 | 75 | Homo sapiens | |
| PCSK2 Peptidase | S08.073 | 21 | 68 | Mus Musculus | |
| KPC2-type Peptidase | S08.109 | 34 | 115 | Caenorhabditis Elegans | |
| Signal Peptidase I | S26.001 | 141 | 141 | Escherichia Coli | |
| Signal Peptidase I | S26.001 | 54 | 54 | Salmonella Typhimurium |
Supplementary material downloads
1. The compiled substrate datasets consist of 38 different protease types, covering four major protease families. They are Aspartic (A), Cysteine (C), Metallo (M) and Serine (S). After sequence homology reduction, the final datasets contain 3688 substrate sequences and 6637 cleavage sites. The curated substrate dataset of each protease can be respectively downloaded by clicking the MEROPS ID of each protease family in the above table. Alternatively, you can download the whole substrate dataset all the thirty-eight proteases at this link: Substrate_seq.tar.gz. And the proteome-wide scan results for seven proteases can be downloaded at this link: Proteome_wide_scan_results
For each entry (starting with ">") of a substrate:
label The first line denotes the Uniprot ID, then followed by the MEROPS ID for the corresponding proteasese that can cleave the substrate. These two annotations are separated by "|" ;
label The second line started with "site:" denotes the substrate cleavage site through P4 to P4' sites, "|" indicates the cleavage site. Note that a substrate might have one to more experimentally verfied cleavage sites;
label The FASTA format of the substrate sequence following with the cleavage sites;
label The fourth part denotes the predicted secondary structure information by the PSIPRED program (Jones, 1999). "H" denotes alpha-helix, "E" denotes beta-strand, while "C" denotes coils or loops;
label The fifth part denotes the predicted solvent accessibility information by the NetSurfP program (Petersen Bi et al., 2009). "e" denotes exposed, while "b" denotes buried;
label The last part denotes the predicted natively unstructured or disordered regions by DISOPRED 2 program (Ward et al., 2004). "*" denotes disordered, while "." denots structured or ordered.



