iProt-Sub
Protease Specificity Prediction Server

Datasets Downloads

All of these substrate data were mainly derived from the MEROPS database, an online information resource for proteases and their inhibitors (Rawlings et al., Nucleic Acids Res 2008, 36, D320-D325). To avoid over-training, sequence homology reduction within the training and testing datasets was performed in such a way that sequence identity between any two peptide sequences should not be larger than 70%.

We extracted 38 substrate dataset, which are protease-specific and species-specific. The following table describes the statistics of the substrate datasets we used to develop the iProt-Sub tool for predicting cleavage sites of multiple proteases. Each substrate dataset can be downloaded by clicking the hyperlink associated with each MEROPS ID of protease family in this table.

Table 1: The list of the proteases covered by iProt-Sub
Proteases name Merops ID Number of Substrates Number of cleavage sites Species Family
Cathepsin D A01.009 23 59 Homo sapiens Aspartic
Cathepsin D A01.009 342 579 Mus Musculus
Cathepsin E A01.010 655 1216 Mus Musculus
Cathepsin L C01.032 17 63 Homo sapiens Cysteine
Calpain-1 C02.001 30 61 Homo sapiens
Calpain-2 C02.002 17 66 Homo sapiens
Caspase-1 C14.001 47 53 Mus Musculus
Caspase-3 C14.003 251 373 Homo sapiens
Caspase-7 C14.004 48 64 Homo sapiens
Caspase-6 C14.005 58 165 Homo sapiens
Caspase-8 C14.009 37 56 Homo sapiens
Matrix Metallopeptidase-1 M10.001 21 52 Homo sapiens Metallo
Matrix Metallopeptidase-8 M10.002 23 85 Homo sapiens
Matrix Metallopeptidase-2 M10.003 35 115 Homo sapiens
Matrix Metallopeptidase-9 M10.004 43 290 Homo sapiens
Matrix Metallopeptidase-3 M10.005 44 132 Homo sapiens
Matrix Metallopeptidase-7 M10.008 42 142 Homo sapiens
Matrix Metallopeptidase-12 M10.009 23 178 Homo sapiens
Matrix Metallopeptidase-13 M10.013 23 90 Homo sapiens
Membrane-type Matrix Metallopeptidase-12 M10.014 36 92 Homo sapiens
ADAMTS4 Peptidase M12.221 13 50 Homo sapiens
Neprilysin M13.001 19 67 Homo sapiens
Insulysin M16.002 6 50 Homo sapiens
Granzyme B S01.010 410 515 Homo sapiens Serine
Granzyme B S01.010 77 88 Mus Musculus
Kallikrein-related Peptidase 5 S01.017 31 59 Homo sapiens
Elastase-2 S01.131 45 133 Homo sapiens
Granzyme A S01.135 44 57 Homo sapiens
Granzyme B S01.136 143 157 Homo sapiens
Granzyme B S01.136 168 201 Mus Musculus
Granzyme M S01.139 491 707 Homo sapiens
Plasmin S01.233 42 89 Homo sapiens
Kallikrein-related Peptidase 4 S01.251 78 80 Homo sapiens
Furin S08.071 56 75 Homo sapiens
PCSK2 Peptidase S08.073 21 68 Mus Musculus
KPC2-type Peptidase S08.109 34 115 Caenorhabditis Elegans
Signal Peptidase I S26.001 141 141 Escherichia Coli
Signal Peptidase I S26.001 54 54 Salmonella Typhimurium

 

Supplementary material downloads

1. The compiled substrate datasets consist of 38 different protease types, covering four major protease families. They are Aspartic (A), Cysteine (C), Metallo (M) and Serine (S). After sequence homology reduction, the final datasets contain 3688 substrate sequences and 6637 cleavage sites. The curated substrate dataset of each protease can be respectively downloaded by clicking the MEROPS ID of each protease family in the above table. Alternatively, you can download the whole substrate dataset all the thirty-eight proteases at this link: Substrate_seq.tar.gz. And the proteome-wide scan results for seven proteases can be downloaded at this link: Proteome_wide_scan_results

For each entry (starting with ">") of a substrate:

label The first line denotes the Uniprot ID, then followed by the MEROPS ID for the corresponding proteasese that can cleave the substrate. These two annotations are separated by "|" ;

label The second line started with "site:" denotes the substrate cleavage site through P4 to P4' sites, "|" indicates the cleavage site. Note that a substrate might have one to more experimentally verfied cleavage sites;

label The FASTA format of the substrate sequence following with the cleavage sites;

label The fourth part denotes the predicted secondary structure information by the PSIPRED program (Jones, 1999). "H" denotes alpha-helix, "E" denotes beta-strand, while "C" denotes coils or loops;

label The fifth part denotes the predicted solvent accessibility information by the NetSurfP program (Petersen Bi et al., 2009). "e" denotes exposed, while "b" denotes buried;

label The last part denotes the predicted natively unstructured or disordered regions by DISOPRED 2 program (Ward et al., 2004). "*" denotes disordered, while "." denots structured or ordered.