SWISS-PROT RELEASE 28.0 RELEASE NOTES
1. INTRODUCTION
1.1 Evolution
Release 28.0 of SWISS-PROT contains 36000 sequence entries, comprising
12'496'420 amino acids abstracted from 33903 references. This represents
an increase of 10.1% over release 27. The recent growth of the data bank
is summarized below.
Release Date Number of entries Nb of amino acids
3.0 11/86 4160 969 641
4.0 04/87 4387 1 036 010
5.0 09/87 5205 1 327 683
6.0 01/88 6102 1 653 982
7.0 04/88 6821 1 885 771
8.0 08/88 7724 2 224 465
9.0 11/88 8702 2 498 140
10.0 03/89 10008 2 952 613
11.0 07/89 10856 3 265 966
12.0 10/89 12305 3 797 482
13.0 01/90 13837 4 347 336
14.0 04/90 15409 4 914 264
15.0 08/90 16941 5 486 399
16.0 11/90 18364 5 986 949
17.0 02/91 20024 6 524 504
18.0 05/91 20772 6 792 034
19.0 08/91 21795 7 173 785
20.0 11/91 22654 7 500 130
21.0 03/92 23742 7 866 596
22.0 05/92 25044 8 375 696
23.0 08/92 26706 9 011 391
24.0 12/92 28154 9 545 427
25.0 04/93 29955 10 214 020
26.0 07/93 31808 10 875 091
27.0 10/93 33329 11 484 420
28.0 02/94 36000 12 496 420
1.2 Source of data
Release 28.0 has been updated using protein sequence data from release
38.0 of the PIR (Protein Identification Resource) protein data bank, as
well as translation of nucleotide sequence data from release 37.0 of the
EMBL Nucleotide Sequence Database.
As an indication to the source of the sequence data in the SWISS-PROT
data bank we list here the statistics concerning the DR (Database cross-
references) pointer lines:
Entries with pointer(s) to only PIR entri(es): 4624
Entries with pointer(s) to only EMBL entri(es): 5593
Entries with pointer(s) to both EMBL and PIR entri(es): 25048
Entries with no pointers lines: 735
<PAGE>
2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 27
2.1 Sequences and annotations
About 2700 sequences have been added since release 27, the sequence data
of 379 existing entries has been updated and the annotations of 5700
entries have been revised.
In particular we have started a process to 'clean-up' the various
representations of domains in the feature lines (especially the usage of
the feature keys "DOMAIN", "REPEAT", "DNA_BIND", and "SITE"). We also
have undertook an overall revision of the CC topics "SUBCELLULAR
LOCATION", "SUBUNIT and "CAUTION". Most of the work has already been
carried out for this release and we plan to finish this major annotation
revamping for the next release.
2.2 What's happening with the model organisms
As we announced in the last release we have selected a number of
organisms that are the target of genome sequencing and/or mapping
projects and for which we intend to:
- Be as complete as possible. All sequences available at a given time
should be immediately included in SWISS-PROT. This also includes
sequence corrections and updates.
- Provide a high level of annotations.
- Cross-references to specialized database(s) that contain, among other
data, some genetic information about the genes that code for these
proteins.
- Provide specific indices or documents.
Thanks to a collaborative effort with Douglas Smith and Bill Loomis of
UCSD we have added a fifth organism, Dictyostelium discoideum (slime
mold), to our list of model organisms. Many new sequences were added at
this release and a new document file (DICTY.TXT) lists all the
D.discoideum sequence entries in SWISS-PROT and their corresponding gene
names.
At this release we also have started our collaboration with the group at
the Sanger Genome Center in Hinxton (UK) and we have added 516 new
C.elegans sequences; most of which are translation of sequencing data
from the genome project. A new document file (CELEGANS.TXT) list all the
C.elegans sequence entries in SWISS-PROT and their corresponding gene
names and, when appropriate, their cosmid-derived names.
Here is the current status of the five model organisms:
Organism Database Index file Number of
cross-referenced sequences
-------------- ---------------------- -------------- ---------
C.elegans WormPep CELEGANS.TXT 672
D.discoideum DictyDB DICTY.TXT 183
D.melanogaster FlyBase In preparation 600
E.coli EcoGene ECOLI.TXT 2555
S.cerevisiae LISTA (in preparation) YEAST.TXT 1731
<PAGE>
2.3 The Expasy World-Wide Web server
2.3.1 Background information
The World-Wide Web (WWW), which originated at CERN, is a powerful global
information system merging networked information retrieval and
hypertext. It gives access, using hypertext links, to the documents and
information contained in all the existing WWW servers around the world,
as well as to the data obtainable through other information retrieval
systems like WAIS, Gopher, X500, etc. To access a WWW server, one has to
run on a local computer a client program (a WWW browser), which displays
hypertext documents. The user can then either request a keyword search
or jump to another document by following a hypertext link. WWW has the
outstanding advantage of extending the hypertext model to the whole
world (by allowing hypertext jumps to documents anywhere on the internet
network) and by being device and user-interface independent (browsers
exist for a variety of computers and user-interfaces, including Unix
workstations running XWindows, MacIntoshes and PCs with Microsoft
Windows).
The ExPASy WWW server allows access, using the user-friendly hypertext
model, to the SWISS-PROT and SWISS-2DPAGE databases and, through any
SWISS-PROT protein sequence entry, to other databases such as EMBL,
PROSITE, REBASE, FlyBase, PDB, OMIM and Medline. Using a browser which
is able to display images one can also remotely access 2D gels image
data from SWISS-2DPAGE.
A WWW server can be accessed on the internet through its Uniform
Resource Locator (URL), the addressing system defined by the WWW model.
The URL for the ExPASy WWW server is:
http://expasy.hcuge.ch/
or
http://129.195.254.61/
To access a WWW server, you need to run a browser (or client) program on
your local computer. Browsers exist for a variety of machines and may be
obtained by anonymous ftp. ExPASy can be used with any WWW browser, but
we recommend NCSA Mosaic. It is a very flexible and powerful browser
with a graphical user interface; available for Unix boxes using
X11/Motif; for Apple McIntoshes and for Microsoft Windows. You can get
it from the FTP site: ftp.ncsa.uiuc.edu.
To access all the data available from SWISS-2DPAGE, the user's local
computer needs to run an image viewing program. For most browsers on
Unix workstations the default program is xv, a shareware application.
Similar Windows or Apple shareware or public domain applications are
also available.
For more information on the ExPASy WWW server, you can read the
following article:
Appel R.D., Sanchez J.-C., Bairoch A., Golaz O., Miu M., Pasquali C.,
Reynaldo Vargas J., Hughes G.J., Hochstrasser D.F.
Electrophoresis 14:1232-1238(1993).
<PAGE>
Or you can contact Dr. Ron Appel:
Email: appel@cih.hcuge.ch
Fax: +41-22-372 61 98
2.3.2 Changes to the WWW ExPASy server
There has been quite a number of changes to the server in the last three
months. We want to list specifically the following enhancements:
- It is now possible to retrieve the Medline abstract of any reference
in SWISS-PROT.
- Full text searches of SWISS-PROT have been implemented.
- The data available on the server includes the latest full release of
SWISS-PROT as well as the cumulative weekly updates.
- Most SWISS-PROT documents such as the new indices for the model
organisms, are available as hypertext documents.
- The SWISS-2DPAGE part of the server has been greatly enhanced with
new functionalities.
2.4 Changes in the DR line
We have added cross-references to the Dictyostelium discoideum genome
database (DictyDB) (see section 2.2 of these notes). These cross-
references are present in the DR lines:
Data bank identifier: DICTYDB
Primary identifier: Unique identifier attributed by DictyDB to the
gene coding for the protein.
Secondary identifier: The gene designation (name). A "-" is present
when no gene name has yet been assigned.
Example: DR DICTYDB; DD01047; MYOA.
2.5 Weekly updates of SWISS-PROT
Since release 24, we provide weekly updates of SWISS-PROT.
[Note: due to the fact that we were in the process of 'cleaning up' the
annotations of many entries (see section 2.1), we temporarily stopped
providing weekly updates from December 1993 to February 1994.]
The weekly updates are available by anonymous FTP. Three files are
updated at each update:
new_seq.dat Contains all the new entries since the last full release.
upd_seq.dat Contains the entries for which the sequence data has been
updated since the last release.
upd_ann.dat Contains the entries for which one or more annotation
fields have been updated since the last release.
Currently these files are available on the following anonymous ftp
servers:
<PAGE>
Organism ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/swiss-prot/updates
Organism National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov (or 130.14.20.1)
Directory /repository/swiss-prot/updates
Organism EMBL ftp server
Address ftp.embl-heidelberg.de (or 192.54.41.33)
Directory /pub/databases/swissprot/new
!! Important notes !!!
Although we try to follow a regular schedule, we do not promise to
update these files every week. In some cases two weeks will elapse in-
between two updates.
Due to the current mechanism used to build a release the entries that
are provided in these updates are not guaranteed to be error free. Also,
for the same reason, new entries do not contain an OC (Organism
Classification) line.
3. ENZYME AND PROSITE
3.1 The ENZYME data bank
Release 15.0 of the ENZYME data bank is distributed with release 28 of
SWISS-PROT. ENZYME release 15.0 contains information relative to 3489
enzymes.
3.2 The PROSITE data bank
3.2.1 Release 11.1
Release 11.1 of the PROSITE data bank is distributed with release 28 of
SWISS-PROT. Release 11.1 contains 715 documentation chapters that
describes 926 different patterns. Release 11.1 does not really represent
a new release; the only changes between releases 11.0 and 11.1 are
updating of the pointers to the SWISS-PROT entries whose name have been
modified between releases 27 and 28. The next release of PROSITE (12.0)
will be distributed with release 29 of SWISS-PROT.
3.2.2 Future developments
Starting with the next major releases (12.0 of June 1994), PROSITE will
be extended to include weight matrices (also known as profiles). There
are a number of protein families as well as functional or structural
domains that cannot be detected using patterns due to their extreme
sequence divergence. Typical examples of important functional domains
which are weakly conserved are the immunoglobulin domains, the SH2 and
<PAGE>
SH3 domains, or the fibronectin type III domain. In such domains there
are only a few sequence positions which are well conserved. Any attempt
of building a consensus pattern for such regions will either fail to
pick up a significant proportion of the protein sequences that contain
such region (false negative) or will pick up too many proteins that do
not contain the region (false positive). The use of technique based on
weight matrices or profiles allows the detection of such proteins or
domains. Dr. Philipp Bucher at ISREC in Lausanne and myself are
collaborating to include such methods into PROSITE. This collaboration
also includes other participants such as Roland Luethy (AMGEN), Michael
Gribskov (SDSC) and Steve Altschul (NCBI). If you are interested in
participating in this project please contact Philipp Bucher at:
pbucher@isrec-sun1.unil.ch
Important notice for software developers: the integration of profiles
into PROSITE will not "break" the current format. The profiles entries
in the PROFILE.DAT file will be tagged with the token "MATRIX" on the
"ID" line (currently, only "PATTERN" and "RULE" are used as tokens); a
new line-type "MA" will be used in these entries to store all the weight
matrices specific parameters. The format of the PROSITE.DOC file will
not be changed.
The full description of the format of the PROSITE profile extensions
will be available in a couple of weeks as a user's manual (file name:
PROFILE.TXT) that will be posted on the ExPASy and NCBI file servers.
Organism ExPASy (Geneva University Expert Protein Analysis System)
Address expasy.hcuge.ch (or 129.195.254.61)
Directory /databases/prosite
Organism National Center for Biotechnology Information (NCBI)
Address ncbi.nlm.nih.gov (or 130.14.20.1)
Directory /repository/prosite
Here is an example of a PROSITE profile:
ID SH3; MATRIX.
AC PS90001;
DT JUN-1994 (CREATED); JUN-1994 (DATA UPDATE); JUN-1994 (INFO UPDATE).
DE SH3 domain profile.
CC /TAXO-RANGE=??E??; /MAX-REPEAT=2;
MA /GENERAL_SPEC: ALPHABET='ACDEFGHIKLMNPQRSTVWY';
MA /DISJOINT: DEFINITION=PROTECT; N1=1; N2=53.
MA /NORMALIZATION: MODE=1; FUNCTION=GRIBSKOV; R1=2.97; R2=-0.0035;
MA R3=0.7386; R4=-1.001; R5=0.208; TEXT='ZScore';
MA /NORMALIZATION: MODE=2; FUNCTION=LINEAR; R1=0.0; R2=100.0;
MA TEXT='OrigScore';
MA /CUT_OFF: LEVEL=0; SCORE=600; N_SCORE=7.0; MODE=1;
MA /DEFAULT: MI=-26; I=-3; IM=0; MD=-26; D=-3; DM=0;
<PAGE>
MA /M: SY='F';M=-2,-3,-3,-4,2,-3,-2,1,-2,0,-1,-2,-3,-3,-4,-2,-1,0,-5,2;
MA /M: SY='I';M=-1,-5,-2,-3,-2,-3,0,1,1,-1,1,-1,-2,-1,1,-1,0,1,-4,-4;
MA /M: SY='A';M=2,-3,1,0,-5,2,-2,-1,-1,-3,-2,1,1,0,-2,2,2,0,-8,-5;
MA /M: SY='L';M=-3,-8,-5,-4,2,-6,-2,2,-4,6,4,-3,-3,-2,-3,-3,-2,1,-3,0;
MA /M: SY='Y';M=-4,-2,-6,-6,9,-7,0,-1,-5,-1,-3,-3,-6,-5,-6,-4,-4,-4,-1,11;
MA /M: SY='D';M=1,-6,3,3,-7,0,0,-2,-1,-4,-3,2,0,1,-2,0,0,-2,-9,-6;
MA /M: SY='Y';M=-5,-3,-6,-6,10,-7,-1,-1,-2,-1,-2,-3,-6,-5,-5,-4,-4,-4,-1,11;
MA /M: SY='K';M=-1,-6,1,1,-4,-2,0,-2,2,-3,-1,1,-1,1,1,0,0,-3,-7,-6;
MA /M: SY='A';M=1,-4,1,0,-5,1,-1,-1,0,-3,-1,1,0,0,0,1,1,-1,-7,-6;
MA /M: SY='R';M=0,-5,0,0,-5,-1,0,-1,1,-3,-1,1,0,1,1,0,0,-2,-5,-5;
MA /M: SY='R';M=0,-5,1,1,-6,0,1,-2,1,-4,-2,1,0,1,2,1,0,-2,-5,-5;
MA /M: SY='E';M=1,-6,2,2,-6,0,0,-2,-1,-4,-2,1,1,1,-1,0,0,-3,-8,-6;
MA /M: SY='D';M=0,-6,2,2,-6,0,1,-3,0,-5,-3,2,-1,2,-1,0,0,-4,-7,-4;
MA /M: SY='D';M=0,-8,4,3,-6,0,0,-2,-1,-3,-2,2,-2,2,-2,0,-1,-3,-9,-6;
MA /M: SY='L';M=-2,-8,-5,-5,2,-5,-3,3,-4,7,5,-4,-3,-3,-4,-3,-2,3,-4,-2;
MA /M: SY='S';M=1,-4,1,1,-5,1,0,-2,1,-4,-2,1,0,0,0,1,1,-2,-6,-5;
MA /M: SY='F';M=-3,-7,-6,-6,6,-5,-3,3,-2,5,3,-4,-5,-4,-5,-4,-3,1,-3,3;
MA /M: SY='Q';M=-1,-6,0,0,-3,-2,1,-1,1,-2,0,0,-1,1,1,-1,0,-1,-6,-4;
MA /M: SY='K';M=-1,-8,0,1,-3,-2,0,-2,3,-3,0,1,0,2,2,0,0,-3,-6,-6;
MA /M: SY='G';M=2,-5,1,0,-7,7,-3,-4,-2,-6,-4,1,-1,-2,-4,2,0,-2,-10,-8;
MA /M: SY='D';M=1,-7,5,4,-8,1,1,-3,0,-5,-3,2,-1,2,-2,0,0,-4,-10,-6;
MA /M: SY='I';M=0,-5,-1,-2,-2,-2,-1,2,0,0,1,-1,-2,0,0,-1,0,1,-6,-5;
MA /M: SY='L';M=-2,-6,-5,-5,3,-5,-3,4,-3,6,4,-4,-4,-3,-4,-3,-2,3,-5,0;
MA /M: SY='Q';M=-1,-5,-1,-1,-3,-2,0,0,0,-2,-1,0,-1,0,0,-1,0,-1,-6,-3;
MA /M: SY='V';M=0,-4,-3,-4,-1,-3,-3,5,-3,3,3,-2,-2,-2,-3,-2,0,5,-8,-4;
MA /M: SY='L';M=-1,-6,-3,-3,-1,-3,-2,2,-3,3,2,-2,-2,-2,-3,-2,-1,2,-5,-3;
MA /M: SY='D';M=0,-6,3,3,-6,0,1,-3,2,-5,-2,2,-1,2,1,0,0,-4,-7,-5;
MA /M: SY='K';M=-1,-6,0,0,-2,-1,0,-3,3,-4,-1,1,-1,0,1,0,0,-3,-6,-4;
MA /M: SY='N';M=1,-4,1,1,-5,0,0,-2,0,-3,-2,1,1,0,-1,1,1,-1,-7,-5;
MA /I: MI=0; I=-1; MD=0; /M SY='X'; M=0; D=-1;
MA /M: SY='G';M=1,-5,0,0,-5,1,-2,-1,-2,-3,-2,0,0,-1,-2,0,0,-1,-8,-6;
MA /M: SY='G';M=1,-6,3,3,-7,3,0,-4,-1,-5,-4,2,-1,1,-2,1,0,-3,-10,-6;
MA /M: SY='W';M=-9,-12,-9,-11,1,-11,-4,-8,-5,-3,-6,-6,-8,-7,3,-4,-8,-9,26,0;
MA /M: SY='W';M=-7,-9,-9,-9,0,-9,-4,-5,-5,-1,-4,-6,-7,-6,2,-3,-6,-6,18,-1;
MA /M: SY='K';M=-1,-7,0,0,-3,-2,0,-2,2,-3,-1,1,-1,1,2,0,-1,-3,-5,-5;
MA /M: SY='G';M=2,-3,0,-1,-6,3,-3,-2,-3,-4,-3,0,0,-2,-3,1,0,0,-10,-6;
MA /M: SY='Q';M=-2,-6,0,0,-3,-3,1,-2,0,-2,-1,0,-2,1,1,-1,-1,-3,-5,-3;
MA /I: MI=0; I=-2; MD=0; /M SY='X'; M=0; D=-2;
MA /M: SY='T';M=0,-4,-1,-1,-4,0,-2,0,-1,-2,0,0,-1,-1,-1,0,1,0,-7,-5;
MA /M: SY='T';M=0,-5,0,0,-3,-1,-1,-1,1,-3,-1,1,-1,0,0,1,1,-1,-6,-4;
MA /M: SY='G';M=0,-5,0,-1,-5,3,-2,-3,-1,-5,-3,0,-1,-1,-1,1,0,-2,-7,-6;
MA /M: SY='K';M=0,-6,1,1,-5,-1,1,-2,2,-4,-1,1,-1,2,2,0,0,-3,-6,-6;
MA /M: SY='R';M=-1,-6,-1,-1,-5,-3,1,-1,1,-3,-1,0,-1,1,3,-1,-1,-2,-2,-6;
MA /M: SY='G';M=1,-5,0,0,-6,6,-3,-3,-3,-5,-4,0,-1,-2,-4,1,0,-2,-10,-6;
MA /M: SY='W';M=-5,-5,-5,-5,2,-6,-2,-2,-4,-1,-3,-3,-6,-5,-3,-3,-4,-4,4,3;
MA /M: SY='F';M=-3,-5,-6,-6,6,-5,-3,4,-1,3,2,-4,-4,-5,-4,-3,-2,2,-4,3;
MA /M: SY='P';M=2,-4,-1,-1,-7,-1,0,-3,-2,-4,-3,-1,8,0,0,1,0,-2,-8,-7;
MA /M: SY='G';M=1,-3,0,0,-4,2,-1,-2,0,-3,-2,0,0,-1,-1,1,1,-1,-6,-5;
MA /M: SY='N';M=1,-5,2,1,-5,0,1,-2,1,-4,-2,2,0,0,0,1,1,-2,-7,-4;
MA /M: SY='Y';M=-5,-1,-7,-7,10,-8,-1,-1,-5,-1,-3,-3,-7,-6,-6,-4,-4,-5,0,13;
MA /M: SY='V';M=0,-3,-3,-5,-2,-2,-3,5,-3,2,2,-2,-2,-3,-4,-1,0,5,-8,-5;
MA /M: SY='E';M=1,-6,2,3,-6,0,0,-2,1,-4,-2,1,0,2,0,0,0,-3,-8,-6;
MA /M: SY='P';M=0,-5,-1,-1,-2,-2,-1,-2,-1,-3,-2,0,1,-1,-2,0,-1,-2,-6,-3;
//
<PAGE>
WE NEED YOUR HELP !
We welcome feedback from our users. We would especially appreciate that
you notify us if you find that sequences belonging to your field of
expertise are missing from the data bank. We also would like to be
notified about annotations to be updated, if, for example, the function
of a protein has been clarified or if new post-translational information
has become available.
<PAGE>
APPENDIX A: SOME STATISTICS
A.1 Amino acid composition
A.1.1 Composition in percent for the complete data bank
Ala (A) 7.62 Gln (Q) 4.02 Leu (L) 9.20 Ser (S) 7.13
Arg (R) 5.23 Glu (E) 6.26 Lys (K) 5.83 Thr (T) 5.82
Asn (N) 4.47 Gly (G) 6.98 Met (M) 2.36 Trp (W) 1.29
Asp (D) 5.28 His (H) 2.25 Phe (F) 4.01 Tyr (Y) 3.22
Cys (C) 1.77 Ile (I) 5.59 Pro (P) 5.01 Val (V) 6.52
Asx (B) 0.005 Glx (Z) 0.005 Xaa (X) 0.02
A.1.2 Classification of the amino acids by their frequency
Leu, Ala, Ser, Gly, Val, Glu, Lys, Thr, Ile, Asp, Arg, Pro, Asn, Gln,
Phe, Tyr, Met, His, Cys, Trp
A.2 Repartition of the sequences by their organism of origin
Total number of species represented in this release of SWISS-PROT: 4283
A.2.1 Table of the frequency of occurrence of species
Species represented 1x: 1939
2x: 703
3x: 398
4x: 250
5x: 179
6x: 160
7x: 94
8x: 68
9x: 87
10x: 47
11- 20x: 170
21- 50x: 111
51-100x: 36
>100x: 41
<PAGE>
A.2.2 Table of the most represented species
Number Frequency Species
1 2663 Human
2 2555 Escherichia coli
3 1731 Baker's yeast (Saccharomyces cerevisiae)
4 1578 Mouse
5 1453 Rat
6 679 Bovine
7 672 Caenorhabditis elegans
8 600 Fruit fly (Drosophila melanogaster)
9 529 Bacillus subtilis
10 515 Chicken
11 380 African clawed frog (Xenopus laevis)
12 367 Rabbit
13 352 Salmonella typhimurium
14 327 Pig
15 251 Vaccinia virus (strain Copenhagen)
16 236 Maize
17 211 Arabidopsis thaliana (Mouse-ear cress)
18 200 Bacteriophage T4
19 193 Human cytomegalovirus (strain AD169)
20 183 Slime mold (Dictyostelium discoideum)
183 Vaccinia virus (strain WR)
22 182 Rice
23 176 Pseudomonas aeruginosa
24 170 Tobacco
25 169 Pea
26 165 Wheat
27 164 Fission yeast (Schizosaccharomyces pombe)
28 149 Barley
29 146 Variola virus
30 138 Dog
31 136 Soybean
32 131 Staphylococcus aureus
131 Sheep
34 129 Spinach
35 122 Neurospora crassa
36 120 Marchantia polymorpha (Liverwort)
37 118 Rhodobacter capsulatus
38 116 Pseudomonas putida
39 112 Klebsiella pneumoniae
40 108 Agrobacterium tumefaciens
<PAGE>
A.3 Repartition of the sequences by size
From To Number From To Number
1- 50 2081 1001-1100 342
51- 100 3595 1101-1200 227
101- 150 4999 1201-1300 176
151- 200 3459 1301-1400 103
201- 250 3057 1401-1500 102
251- 300 2681 1501-1600 51
301- 350 2516 1601-1700 49
351- 400 2572 1701-1800 40
401- 450 1943 1801-1900 43
451- 500 2065 1901-2000 31
501- 550 1413 2001-2100 17
551- 600 1014 2101-2200 43
601- 650 714 2201-2300 50
651- 700 533 2301-2400 18
701- 750 514 2401-2500 22
751- 800 393 >2500 114
801- 850 305
851- 900 319
901- 950 199
951-1000 200
Currently the ten longest sequences are:
HTS1_COCCA 5217 a.a.
FAT_DROME 5147 a.a.
RYNR_RABIT 5037 a.a.
RYNR_HUMAN 5032 a.a.
RYNC_RABIT 4969 a.a.
DYHC_DICDI 4725 a.a.
APB_HUMAN 4563 a.a.
APOA_HUMAN 4548 a.a.
RRPA_CVMJH 4488 a.a.
DYHC_TRIGR 4466 a.a.
<PAGE>
APPENDIX B: ON-LINE EXPERTS
B.1 List of on-line experts for PROSITE and SWISS-PROT
Field of expertise Name Email address
--------------------------- ------------------ ----------------------------
Alcohol dehydrogenases Joernvall H. hans.jornvall@k1m.ki.se
Persson B. bengt.persson@embl-heidelberg.de
Aldehyde dehydrogenases Joernvall H. hans.jornvall@k1m.ki.se
Persson B. bengt.persson@embl-heidelberg.de
Alpha-crystallins/HSP-20 Leunissen J.A.M. jackl@caos.caos.kun.nl
de Jong W. u629000@hnykun11.bitnet
Alpha-2-macroglobulins Van Leuven F. fred@blekul13.bitnet
AA-tRNA synthetases class II Leberman R. leberman@frembl51.bitnet
Apolipoproteins Boguski M.S. boguski@ncbi.nlm.nih.gov
AraC family HTH proteins Ramos J.L. jlramos@cnbvx3.cnb.uam.es
Arrestins Kolakowski L.F.Jr. lfk@receptor.mgh.harvard.edu
Asparaginase / glutaminase Gribskov M. gribskov@sdsc.edu
ATP synthase c subunit Recipon H. recipon@ncbi.nlm.nih.gov
Band 4.1 family proteins Rees J. jrees@vax.oxford.ac.uk
Beta-lactamases Brannigan J. jab5@vaxa.york.ac.uk
Beta-transducin family Boguski M.S. boguski@ncbi.nlm.nih.gov
C-type lectin domain Drickamer K. drick@cuhhca.hhmi.columbia.edu
Chalcone/stilbene synthases Schroeder J. raf@sun1.ruf.uni-freiburg.de
Chaperonins cpn10/cpn60 Georgopoulos C. georgopo@cmu.unige.ch
Chaperonins TCP1 family Willison K.R. willison@icr.ac.uk
Chitinases Henrissat B. bernie@cermav.grenet.fr
Clusterin Peitsch M.C. mcp13936@ggr.co.uk
Cold shock domain Landsman D. landsman@ncbi.nlm.nih.gov
CTF/NF-I Mermod N. nmermod@ulys.unil.ch
Gronostajski R. gronosr@ccsmtp.ccf.org
Cytochromes P450 Holsztynska E.J. ela@netcom.uucp
netcom!ela@apple.com
DEAD-box helicases Linder P. linder@urz.unibas.ch
Deoxyribonuclease I Peitsch M.C. mcp13936@ggr.co.uk
dnaJ family Kelley W. kelley@cmu.unige.ch
EF-hand calcium-binding Cox J.A. cox@sc2a.unige.ch
Kretsinger R.H. rhk5i@virginia.bitnet
Elongation factor 1 Amons R. wmbamons@rulgl.leidenuniv.nl
Enoyl-CoA hydratase Hofmann K.O. khofmann@biomed.biolan.uni-koeln.de
Fatty acid desaturases Piffanelli P. piffanelli@jii.afrc.ac.uk
fruR/lacI family HTH proteins Reizer J. jreizer@ucsd.edu
GATA-type zinc-fingers Boguski M.S. boguski@ncbi.nlm.nih.gov
GDT/GTP dissociation stimul. Boguski M.S. boguski@ncbi.nlm.nih.gov
GltP family of transporters Hofmann K.O. khofmann@biomed.biolan.uni-koeln.de
Glucanases Henrissat B. bernie@cermav.grenet.fr
Beguin P. phycel@pasteur.bitnet
Glutamine synthetase Tateno Y. ytateno@genes.nig.ac.jp
G-protein coupled receptors Chollet A. arc3029@ggr.co.uk
Attwood T.K. bph6tka@biovax.leeds.ac.uk
Kolakowski L.F.Jr. lfk@receptor.mgh.harvard.edu
<PAGE>
GTPase-activating proteins Boguski M.S. boguski@ncbi.nlm.nih.gov
HIT family Seraphin B. seraphin@embl-heidelberg.de
HMG1/2 and HMG-14/17 Landsman D. landsman@ncbi.nlm.nih.gov
Inorganic pyrophosphatases Kolakowski L.F.Jr. lfk@receptor.mgh.harvard.edu
Integrases Roy P.H. 2020000@saphir.ulaval.ca
Kringle domain Ikeo K. kikeo@genes.nig.ac.jp
Lipocalins Boguski M.S. boguski@ncbi.nlm.nih.gov
Peitsch M.C. mcp13936@ggr.co.uk
lysR family HTH proteins Henikoff S. henikoff@sparky.fhcrc.org
MAC components / perforin Peitsch M.C. mcp13936@ggr.co.uk
Malic enzymes Glynias M. mglynias@ncsa.uiuc.edu
MAM domain Bork P. bork@embl-heidelberg.de
MIP family proteins Reizer J. jreizer@ucsd.edu
Myelin proteolipid protein Hofmann K.O. khofmann@biomed.biolan.uni-koeln.de
Pancreatic trypsin inhibitor Ikeo K. kikeo@genes.nig.ac.jp
PEP requiring enzymes Reizer J. jreizer@ucsd.edu
pfkB carbohydrate kinases Reizer J. jreizer@ucsd.edu
Phosphomannose isomerases Proudfoot A.E.I. aep6830@ggr.co.uk
Phytochromes Partis M.D. partis@afrc.ac.uk
Plant viruses icosahedral Koonin E.V. koonin@ncbi.nlm.nih.gov
capsid proteins
Protein kinases Quinn A.M. quinn@biomed.med.yale.edu
Hunter T. hunter@salk-sc2.sdsc.edu
PTS proteins Reizer J. jreizer@ucsd.edu
Restriction-modification Bickle T. bickle@urz.unibas.ch
enzymes Roberts R.J. roberts@neb.com
Ribosomal protein S15 Ellis S.R. srelli01@ulkyvm.bitnet
Ring-cleavage dioxygenases Harayama S. sharayam@ddbj.nig.ac.jp
Signal sequence peptidases von Heijne G. gvh@csb.ki.se
Dalbey R.E. rdalbey@magnus.acs.ohio-state.edu
Sodium symporters Reizer J. jreizer@ucsd.edu
Subtilases Brannigan J. jab5@vaxa.york.ac.uk
Siezen R.J. nizo@caos.caos.kun.nl
Thiol proteases Turk B. turk@ijs.ac.mail.yu
Thiol proteases inhibitors Turk B. turk@ijs.ac.mail.yu
TNF family Jongeneel C.V. vjongene@isrecmail.unil.ch
TPR repeats Boguski M.S. boguski@ncbi.nlm.nih.gov
Transit peptides von Heijne G. gvh@csb.ki.se
Type-II membrane antigens Levy S. levy@cellbio.stanford.edu
Uracil-DNA glycosylase Aasland R. aasland@embl-heidelberg.de
Vitamin K-depend. Gla domain Price P.A. pprice@ucsd.edu
XPGC protein Clarkson S.G. clarkson@cmu.unige.ch
Xylose isomerase Jenkins J. jenkins@frira.afrc.ac.uk
WAP-type domain Claverie J.-M. jmc@ncbi.nlm.nih.gov
ZP domain Bork P. bork@embl-heidelberg.de
African swine fever virus Yanez R.J. ryanez@cbm2.uam.es
Bacteriophage P4 Halling C. chh9@midway.uchicago.edu
Caenorhabditis elegans Sonnhammer E. esr@mrc-molecular.biology.cam.ac.uk
Chloroplast encoded proteins Hallick R.B. hallick@arizona.edu
Dictyostelium discoideum Smith D.W. dsmith@ucsd.edu
Drosophila Ashburner M. ma11@phx.cam.ac.uk
Escherichia coli Rudd K. rudd@ncbi.nlm.nih.gov
Salmonella typhimurium Rudd K. rudd@ncbi.nlm.nih.gov
Snakes Stocklin R. stocklin@cmu.unige.ch
<PAGE>
B.2 Requirements to fulfill to become an on-line expert
An expert should be a scientist working with specific famili(es) of
proteins (or specific domains) and who would:
a) Review the protein sequences in SWISS-PROT and the patterns/matrices
in PROSITE relevant to their field of research.
b) Agree to be contacted by people that have obtained new sequence(s)
which seem to belong to "their" familie(s) of proteins.
c) Have access to electronic mail and be willing to use it to send and
receive data.
If you are willing to be part of this scheme please contact Amos Bairoch
at the following electronic mail address:
bairoch@cmu.unige.ch
<PAGE>
APPENDIX C: RELATIONSHIPS BETWEEN BIOMOLECULAR DATABASES
The current status of the relationships (cross-references) between some
biomolecular databases is shown in the following schematic:
**********************
*********************** * EPD [Euk. Promot.] *
* EMBL Nucleotide * <----> **********************
* Sequence Data *
****************** * Library * **********************
* FLYBASE * <--> *********************** <----- * ECD [E. coli map] *
* [Drosophila * ^ ^ **********************
* genomic d.b.] * <--------+ | |
****************** | | +------------ **********************
| | * TFD [Trans. fact.] *
| | +-----------> **********************
****************** | | |
* WormPep * | | | **********************
* [C.elegans] * <----+ | | | +------> * DictyDB [D.disco.] *
****************** | | | | | **********************
| | | | |
****************** | v v v v **********************
* REBASE * *********************** * ENZYME [Nomencl.] *
* [Restriction * <--- * SWISS-PROT * <----- **********************
* enzymes] * * Protein Sequence * |
****************** * Data Bank * v
*********************** **********************
****************** ^ ^ | | ^ ^ | * OMIM [Diseases] *
* EcoGene/EcoSeq * | | | | | | +--------> **********************
* [E. coli] * <-----+ | | | | |
****************** | | | | +-----------> **********************
| | | | * ECO2DBASE [2D] *
| | | | **********************
****************** | | | |
* PROSITE * <--------+ | | +---------------> **********************
* [Patterns] * | | * SWISS-2DPAGE [2D] *
****************** | +---------------+ **********************
| v |
| *********************** | **********************
+--------> * PDB [3D structures] * +--> * Aarhus/Ghent [2D] *
*********************** **********************
<PAGE>